Speech enhancement is crucial in human-computer interaction, especially for ubiquitous devices. Ultrasound-based speech enhancement has emerged as an attractive choice because of its superior ubiquity and performance. However, inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition makes existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human intervention. At its core is a two-stage framework that establishes correspondence between visual and ultrasonic modalities by leveraging audible audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, significantly outperforming state-of-the-art ultrasound-based speech enhancement baselines.
Section 2: Large-scale Datasets Synthesis Enhancement
In this section, we demonstrate the recovery results of the large-scale dataset synthesis without any training, including LJSpeech, TIMIT, and VCTK. (See Section 7.2 in paper)
(1) LJSpeech
(2) TIMIT
(3) VCTK
Section 3: Long-duration Enhancement
In this section, we demonstrate the recovery results of the experiments on long-duration enhancement (See Section 7.3 in paper)
(1) Collected Dataset
(2) LJSpeech
(3) TIMIT
(4) VCTK
Section 4: Different Noise Interference
In this section, we demonstrate the recovery results of the experiments on different noise interference, including different environmental interference, competing speakers inteference, and human voice interference. (See Section 7.5 in paper)
(1) Different Environmental Interference
(2) Competing Speakers Inteference
(3) Human Voice Interference