uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models

CMU

Tencent

Tencent

UIUC

OSU

CMU

Tencent

TL;DR: We build a Unified Speech Enhancement and Editing (uSee) framework with conditional diffusion models to enable fine-grained controllable generation based on both acoustic and textual prompts.

Abstract

In this paper, we propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Specifically, by providing multiple types of conditions including self-supervised learning embeddings and proper text prompts to the score-based diffusion model, we can enable controllable generation of the unified speech enhancement and editing model to perform corresponding actions on the source speech. Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models, and can perform speech editing given desired environmental sound text description, signal-to-noise ratios (SNR), and room impulse responses (RIR).

Demo: Speech Enhancement



Overall Demo:


Source

Target (Denoised)

Source

Target (Dereverbed)

Source

Target



Ablation Experiments:


Source

Without Acoustic Prompts

With Acoustic Prompts

Demo: Speech Editing



Overall Demo:


Example sound

Target (Add bird chirp)

Example sound

Target (Add hard rock)

Example sound

Target (Add happy hour)



Contollable Generation Effect:


Raw audio

Add small room RIR

Add medium room RIR

Add large room RIR

BibTeX

@article{yang2023usee,
  title={uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models},
  author={Yang, Muqiao and Zhang, Chunlei and Xu, Yong and Xu, Zhongweiyang and Wang, Heming and Raj, Bhiksha and Yu, Dong},
  journal={arXiv preprint arXiv:2310.00900},
  year={2023}
}