USpeech Introduction Video Results

USpeech: Ultrasound-Enhanced Speech with Minimal Human Effort via Cross-modal Synthesis

Luca Jiang-Tao Yu Running Zhao Sijie Ji Edith C.H. Ngai Chenshu Wu
The University of Hong Kong  
📄 arXiv        ðŸ”— code

Speech enhancement is crucial in human-computer interaction, especially for ubiquitous devices. Ultrasound-based speech enhancement has emerged as an attractive choice because of its superior ubiquity and performance. However, inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition makes existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human intervention. At its core is a two-stage framework that establishes correspondence between visual and ultrasonic modalities by leveraging audible audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, significantly outperforming state-of-the-art ultrasound-based speech enhancement baselines.

Video
Results

Section 1: Main Experiments
In this section, we demonstrate the recovery results of USpeech and other baselines. (See Section 7.1 in paper)

Clean speech

Noisy speech

UltraSE

Ultraspeech

PHASEN

USpeech w/ syn. (ours)

USpeech w/ phy. (ours)

Section 2: Large-scale Datasets Synthesis Enhancement
In this section, we demonstrate the recovery results of the large-scale dataset synthesis without any training, including LJSpeech, TIMIT, and VCTK. (See Section 7.2 in paper)
(1) LJSpeech

Clean speech

Noisy speech

USpeech w/ syn. (ours)

(2) TIMIT

Clean speech

Noisy speech

USpeech w/ syn. (ours)

(3) VCTK

Clean speech

Noisy speech

USpeech w/ syn. (ours)

Section 3: Long-duration Enhancement
In this section, we demonstrate the recovery results of the experiments on long-duration enhancement (See Section 7.3 in paper)
(1) Collected Dataset

Clean speech

Noisy speech

USpeech w/ syn. (ours)

USpeech w/ phy. (ours)

(2) LJSpeech

Clean speech

Noisy speech

USpeech w/ syn. (ours)

(3) TIMIT

Clean speech

Noisy speech

USpeech w/ syn. (ours)

(4) VCTK

Clean speech

Noisy speech

USpeech w/ syn. (ours)

Section 4: Different Noise Interference
In this section, we demonstrate the recovery results of the experiments on different noise interference, including different environmental interference, competing speakers inteference, and human voice interference. (See Section 7.5 in paper)
(1) Different Environmental Interference

Clean speech

Noisy speech

USpeech w/ syn. (ours)

USpeech w/ phy. (ours)

(2) Competing Speakers Inteference

Clean speech

Noisy speech

USpeech w/ syn. (ours)

USpeech w/ phy. (ours)

(3) Human Voice Interference

Clean speech

Noisy speech

USpeech w/ syn. (ours)

USpeech w/ phy. (ours)

Thanks to Despoina Paschalidou for the website template.