EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable
Text-to-Speech via Activation Steering
Anonymous Authors
Abstract
Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only
coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed
emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These
models also require extensive, high-quality datasets for training. To address these limitations, we
propose EmoSteer-TTS, a novel training-free approach, to achieve
fine-grained speech emotion control (conversion, interpolation, erasure) by
activation steering. We first empirically observe that modifying a subset of the internal
activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized
speech. Building on this insight, we then develop a training-free and efficient algorithm, including
activation extraction, emotional token searching, and inference-time steering, which can be seamlessly
integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to
derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers.
Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous
control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this
is the first method that achieves training-free and continuous fine-grained emotion control in TTS.
Note: The content on this page is for research purposes only.
Method Framework
Our EmoSteer-TTS approach enables training-free and fine-grained emotion control through activation steering:
Figure: Overview of the proposed EmoSteer-TTS. Emotion steering vectors and steering weights are
derived from pairs of
neutral and emotional reference speech samples. During inference, these vectors are used to modulate
the activations of the
TTS model, effectively guiding it to synthesize speech that reflects the desired emotion.
Contributions
We present the first fine-grained and training-free emotion-controllable TTS (EC-TTS) approach by
identifying and modulating internal emotion representations within existing TTS models.
We provide new insights and enhanced interpretability
for continuous EC-TTS by uncovering the emotion steering dynamics in pretrained TTS models, offering
practical guidance for the design of the proposed algorithm.
Extensive objective and subjective evaluations demonstrate the effectiveness of EmoSteer-TTS in
fine-grained speech emotion control, showing its potential applicability across a wide range of
pretrained TTS models.
Training-Free Emotion Conversion
Emotion conversion supports six basic emotions: anger, disgust, fear, happiness, sadness, and surprise.
Target Emotion: Anger
Method
Reference Audio (Neutral)
Text
Synthesized ($\alpha$=2.0)
F5-TTS + EmoSteer-TTS
我都快气疯了你怎么还这么冷静
E2-TTS + EmoSteer-TTS
You had one job and you still managed to screw it up.
CosyVoice2 + EmoSteer-TTS
Completely ridiculous. I can't stay quiet anymore!
Target Emotion: Disgust
Method
Reference Audio (Neutral)
Text
Synthesized ($\alpha$=2.0)
F5-TTS + EmoSteer-TTS
这种人我听着都觉得恶心
E2-TTS + EmoSteer-TTS
It's disgusting how fake people can be in these discussions.
CosyVoice2 + EmoSteer-TTS
That arrogant attitude is absolutely revolting.
Target Emotion: Fear
Method
Reference Audio (Neutral)
Text
Synthesized ($\alpha$=2.0)
F5-TTS + EmoSteer-TTS
这些照片看着怪怪的让我浑身发冷
E2-TTS + EmoSteer-TTS
Something about this doesn't feel right at all.
CosyVoice2 + EmoSteer-TTS
I wish I could pretend I'm not frightened, but I am.
Target Emotion: Happiness
Method
Reference Audio (Neutral)
Text
Synthesized ($\alpha$=2.0)
F5-TTS + EmoSteer-TTS
今天阳光真好一切都让人心情愉快
E2-TTS + EmoSteer-TTS
I feel so grateful and cheerful right now.
CosyVoice2 + EmoSteer-TTS
That news just made my day. I can't stop smiling.
Target Emotion: Sadness
Method
Reference Audio (Neutral)
Text
Synthesized ($\alpha$=2.0)
F5-TTS + EmoSteer-TTS
我努力了,却还是让人失望了
E2-TTS + EmoSteer-TTS
Who is repeating all that hard stuff to you?
CosyVoice2 + EmoSteer-TTS
I wish conversations like this didn't remind me of what's missing.
Target Emotion: Surprise
Method
Reference Audio (Neutral)
Text
Synthesized ($\alpha$=2.0)
F5-TTS + EmoSteer-TTS
我简直不敢相信会发生这种事
E2-TTS + EmoSteer-TTS
That's surprising. I had no idea he wrote so much!
CosyVoice2 + EmoSteer-TTS
I'm honestly shocked by how much that covers.
Training-Free Emotion Interpolation
EmoSteer-TTS can also interpolate between a neutral reference and a target emotion to create a smooth
transition:
Target Emotion: Anger
Reference Audio (Neutral)
Text
Steering Intensity ($\alpha$)
Synthesized
F5-TTS + EmoSteer-TTS
你到底能不能小心点?每次都出问题!
0.0 (Neutral)
0.5
1.0
1.5
2.0
E2-TTS + EmoSteer-TTS
This is absolutely unacceptable!
0.0 (Neutral)
0.5
1.0
1.5
2.0
CosyVoice2 + EmoSteer-TTS
How many times do I have to tell you.
0.0 (Neutral)
0.5
1.0
1.5
2.0
Target Emotion: Disgust
Reference Audio (Neutral)
Text
Steering Intensity ($\alpha$)
Synthesized
F5-TTS + EmoSteer-TTS
这种天气真是让人讨厌到极点。
0.0 (Neutral)
0.5
1.0
1.5
2.0
E2-TTS + EmoSteer-TTS
That's just disgusting.I don't even want to think about it.
0.0 (Neutral)
0.5
1.0
1.5
2.0
CosyVoice2 + EmoSteer-TTS
I can't stand being around this kind of behavior.
0.0 (Neutral)
0.5
1.0
1.5
2.0
Target Emotion: Fear
Reference Audio (Neutral)
Text
Steering Intensity ($\alpha$)
Synthesized
F5-TTS + EmoSteer-TTS
这声音太诡异了,我不敢一个人待在这。
0.0 (Neutral)
0.5
1.0
1.5
2.0
E2-TTS + EmoSteer-TTS
Something about this doesn't feel right at all.
0.0 (Neutral)
0.5
1.0
1.5
2.0
CosyVoice2 + EmoSteer-TTS
I couldn't sleep at all. I was too anxious.
0.0 (Neutral)
0.5
1.0
1.5
2.0
Target Emotion: Happiness
Reference Audio (Neutral)
Text
Steering Intensity ($\alpha$)
Synthesized
F5-TTS + EmoSteer-TTS
不过现在天气放晴了,心情也跟着亮了!
0.0 (Neutral)
0.5
1.0
1.5
2.0
E2-TTS + EmoSteer-TTS
I feel so grateful and cheerful right now!
0.0 (Neutral)
0.5
1.0
1.5
2.0
CosyVoice2 + EmoSteer-TTS
We've been laughing all day. It's been wonderful!
0.0 (Neutral)
0.5
1.0
1.5
2.0
Target Emotion: Sadness
Reference Audio (Neutral)
Text
Steering Intensity ($\alpha$)
Synthesized
F5-TTS + EmoSteer-TTS
天气恶劣让我感到无助和害怕。
0.0 (Neutral)
0.5
1.0
1.5
2.0
E2-TTS + EmoSteer-TTS
The squire himself showed perfect.
0.0 (Neutral)
0.5
1.0
1.5
2.0
CosyVoice2 + EmoSteer-TTS
Take courage all isn't lost yet.
0.0 (Neutral)
0.5
1.0
1.5
2.0
Target Emotion: Surprise
Reference Audio (Neutral)
Text
Steering Intensity ($\alpha$)
Synthesized
F5-TTS + EmoSteer-TTS
天哪,居然需要重新做一遍
0.0 (Neutral)
0.5
1.0
1.5
2.0
E2-TTS + EmoSteer-TTS
I can't believe what I've just heard.
0.0 (Neutral)
0.5
1.0
1.5
2.0
CosyVoice2 + EmoSteer-TTS
Wow! I didn't expect there to be so few activists!
0.0 (Neutral)
0.5
1.0
1.5
2.0
The Emotion Interpolation Ability of Training-Based Baseline Methods
All the samples are grabbed from their official demo pages.
EmoSphere++ (Cho et al. 2025) offers intensity control over four emotions, i.e., Anger,
Happiness, Sadness, and Surprise.
Emotion
Intensity
Synthesized
Emotion
Intensity
Synthesized
Emotion
Intensity
Synthesized
Emotion
Intensity
Synthesized
Anger
Weak
Happiness
Weak
Sadness
Weak
Surprise
Weak
Medium
Medium
Medium
Medium
Strong
Strong
Strong
Strong
HED-TTS (Inoue et al. 2025) offers intensity control over four emotions, i.e., Anger,
Happiness, Sadness, and Surprise.
Emotion
Intensity
Synthesized
Emotion
Intensity
Synthesized
Emotion
Intensity
Synthesized
Emotion
Intensity
Synthesized
Anger
0.0
Happiness
0.0
Sadness
0.0
Surprise
0.0
0.4
0.4
0.4
0.4
0.6
0.6
0.6
0.6
1.0
1.0
1.0
1.0
EmoDubber (Cong et al. 2025) offers intensity control over three emotions, i.e., Anger,
Sadness, and Surprise.
Emotion
Intensity
Synthesized
Emotion
Intensity
Synthesized
Emotion
Intensity
Synthesized
Anger
Weak
Sadness
Zero
Surprise
Zero
Medium
Strong
Strong
Strong
Training-Free Emotion Erasure
EmoSteer-TTS can also erase the emotion of a reference speech to create a neutral speech:
Emotion to Erase: Anger
Method
Before Erasure (($\beta$=0.0))
Text
After Erasure ($\beta$=2.5)
F5-TTS + EmoSteer-TTS
我觉得并不是非要有特别的品质
E2-TTS + EmoSteer-TTS
What am I? I'm a racer, son of God!
CosyVoice2 + EmoSteer-TTS
Hurry up, hurry up!
Emotion to Erase: Disgust
Method
Before Erasure (($\beta$=0.0))
Text
After Erasure ($\beta$=2.5)
F5-TTS + EmoSteer-TTS
It doesn't make any sense. But then again, what in the government does make sense?
E2-TTS + EmoSteer-TTS
Oh yeah. Yeah, no, there's a horrifying moment. I think the guy's from Time or something. I
tried to interview him. It's crass. It's in bad.
CosyVoice2 + EmoSteer-TTS
Oh, it looks so terrible. I don't think I want this, but it's...
Emotion to Erase: Fear
Method
Before Erasure (($\beta$=0.0))
Text
After Erasure ($\beta$=2.5)
F5-TTS + EmoSteer-TTS
So there are huge amounts of danger. I'm very worried because... Continue to share important
stuff with me on Facebook.
E2-TTS + EmoSteer-TTS
Maybe it's old school, but I don't see how that's any of my business. Okay.
CosyVoice2 + EmoSteer-TTS
That world. You're going to hide inside yourself, crying. You'll flood as much of the world to
drown me.
Emotion to Erase: Happiness
Method
Before Erasure (($\beta$=0.0))
Text
After Erasure ($\beta$=2.5)
F5-TTS + EmoSteer-TTS
我已经习惯这种气候了
E2-TTS + EmoSteer-TTS
裁判全给了满分
CosyVoice2 + EmoSteer-TTS
and vowed he'd change the pigtail's place.
Emotion to Erase: Sadness
Method
Before Erasure (($\beta$=0.0))
Text
After Erasure ($\beta$=2.5)
F5-TTS + EmoSteer-TTS
我一直到清晨四点才到家
E2-TTS + EmoSteer-TTS
比如说华尔街日报就很好
CosyVoice2 + EmoSteer-TTS
Fur flew through the air, teeth gnashed.
Emotion to Erase: Surprise
Method
Before Erasure (($\beta$=0.0))
Text
After Erasure ($\beta$=2.5)
F5-TTS + EmoSteer-TTS
This is Jack, the relatives of Tom?
E2-TTS + EmoSteer-TTS
Tom now let our arrows fly?
CosyVoice2 + EmoSteer-TTS
不是很喜欢你把我绕晕了
Conclusion
We presented EmoSteer-TTS, a training-free framework for fine-grained, continuous, and interpretable
emotion
control in speech synthesis.
By steering a subset of internal activations in a TTS model, our method enables flexible emotional
manipulation, including emotion conversion, interpolation, and erasure, without modifying or fine-tuning
the
pretrained TTS model.
We also constructed a curated emotional speech dataset to support steering vector construction.
Extensive experiments confirm that EmoSteer-TTS achieves robust, zero-shot emotion control with broad
applicability, outperforming SOTA methods.
The analysis also offers deeper insights into the emotion steering dynamics of flow matching-based TTS.
To the best of our knowledge, this is the first fine-grained EC-TTS approach that needs no additional
training.