Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching

Anonymous authors

Abstract

Zero-Shot Voice Conversion (VC) aims to transform the source speaker’s timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may leak through prosody, and transferring target prosody to synthesized speech is rarely studied. In light of this, we propose R-VC, a rhythm-controllable and efficient zero-shot voice conversion model. R-VC employs data perturbation techniques and discretize source speech into Hubert content tokens, eliminating much content-irrelevant information. By leveraging a Mask Generative Transformer for in-context duration modeling, our model adapts the linguistic content duration to the desired target speaking style, facilitating the transfer of the target speaker's rhythm. Furthermore, R-VC introduces a powerful Diffusion Transformer (DiT) with shortcut flow matching during training, conditioning the network not only on the current noise level but also on the desired step size, enabling high timbre similarity and quality speech generation in fewer sampling steps, even in just two, thus minimizing latency. Experimental results show that R-VC achieves comparable speaker similarity to state-of-the-art VC methods with a smaller dataset, and surpasses them in terms of speech naturalness, intelligibility and style transfer performance.

Model Overview

Interpolate start reference image.

The overall architecture of R-VC.

Zero-shot Voice Conversion Results

source target FACodec-VC SEF-VC HierSpeech++ CosyVoice-VC Diff-HierVC R-VC(CFM,NFE=10) R-VC(ours,NFE=2)

Emotion Style Transfer Exp1

Note: source speech from test-clean dataset, target speech from ESD dataset

source target FACodec-VC SEF-VC HierSpeech++ CosyVoice-VC Diff-HierVC R-VC(CFM,NFE=10) R-VC(ours,NFE=2)

Emotion Style Transfer Exp2

Note: both source and target speech from ESD dataset

source target FACodec-VC SEF-VC HierSpeech++ CosyVoice-VC Diff-HierVC R-VC(CFM,NFE=10) R-VC(ours,NFE=2)

CFM and Shortcut CFM methods under various sampling steps

Model source target NFE=1 NFE=2 NFE=4 NFE=8 NFE=16 NFE=32
CFM
Shortcut CFM
Model source target NFE=1 NFE=2 NFE=4 NFE=8 NFE=16 NFE=32
CFM
Shortcut CFM
Model source target NFE=1 NFE=2 NFE=4 NFE=8 NFE=16 NFE=32
CFM
Shortcut CFM
Model source target NFE=1 NFE=2 NFE=4 NFE=8 NFE=16 NFE=32
CFM
Shortcut CFM
Model source target NFE=1 NFE=2 NFE=4 NFE=8 NFE=16 NFE=32
CFM
Shortcut CFM
Model source target NFE=1 NFE=2 NFE=4 NFE=8 NFE=16 NFE=32
CFM
Shortcut CFM

Rhythm Control Demo (Rebuttal)

The speech rate of the target speech is categorized into three levels: low, normal, and high. In the R-VC model, the synthesized speech tends to maintain a speech rate consistent with the target speech.

Level source target FACodec-VC SEF-VC HierSpeech++ CosyVoice-VC Diff-HierVC R-VC(ours)
Low
Normal
High