GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech

Online Access

Please access the dataset from Huggingface or Openslr (coming soon).

Please consider cite our work if you find this dataset is useful.


      @misc{wang2024globe,
      title={GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech}, 
      author={Wenbin Wang and Yang Song and Sanjay Jha},
      year={2024},
      eprint={2406.14875},
      archivePrefix={arXiv},
      }
      

Contents

1. Abstract

This paper introduces GLOBE, a high-quality English corpus with worldwide accents, specifically designed to address the limitations of current zero-shot speaker adaptive Text-to-Speech (TTS) systems that exhibit poor generalizability in adapting to speakers with accents. Compared to commonly used English corpora, such as LibriTTS and VCTK, GLOBE is unique in its inclusion of utterances from 23,519 speakers and covers 164 accents worldwide, along with detailed metadata for these speakers. Compared to its original corpus, i.e., Common Voice, GLOBE significantly improves the quality of the speech data through rigorous filtering and enhancement processes, while also populating all missing speaker metadata. The final curated GLOBE corpus includes 535 hours of speech data at a 24 kHz sampling rate. Our benchmark results indicate that the speaker adaptive TTS model trained on the GLOBE corpus can synthesize speech with better speaker similarity and comparable naturalness than that trained on other popular corpora. We will release GLOBE publicly after acceptance.



2. Groud Truth Speech Samples

2.1 Speaker map

Please check out the GT speech samples of VCTK, LibriTTS and GLOBE via following interactive map:

2.2 Speech samples from speakers of different ages (All speakers in GLOBE have age label)

Age Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
teens
twenties
thirties
fourties
fifties
sixties
seventies

2.3 Speech samples from speakers of different accents (All speakers in GLOBE have accent label)

Accent Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
New Zealand English
Irish English
England English
Canadian English
India and South Asia (India, Pakistan, Sri Lanka)
United States English
Northern Irish
Scottish English
Australian English
Filipino
South German accent
Ngeria English
West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)
Ontario English
Malaysian English
Southern African (South Africa, Zimbabwe, Namibia)
French English
Northumbrian British English
Hong Kong English
Liverpool English
Singaporean English
Welsh English

3. Synthesized Speech Samples



Evaluated in LibriTTS testset:
Text Ground-Truth VCTK LibriTTS GLOBE
1. No my lord said Robert taken aback by the disappearance of his friend Pagano.
2. Whose name did you sign to the check? asked Kenneth.
3. The struggle continues with unabated ferocity.
4. I have been dead for nearly as many years.
5. His method was a winner.
6. When Hilda rose, he sat down on the arm of her chair and drew her back into it.
7. She turned quickly and came back two steps.
8. the relative strength of bodies of troops can never be known to anyone.
9. He put his gloves on the chair and he took the proof sheet by sheet to copy them.
10. The long gray slopes leading up to the glacier seem remarkably smooth and unbroken.


Evaluated in GLOBE testset:
Text Ground-Truth VCTK LibriTTS GLOBE
1. In writing Jackie, Davies incorporated both comic and tragic elements.
2. There were substantial efforts to translate Greek texts into Syriac.
3. Arkhangelskoye is the nearest rural locality.
4. Its county seat is Vero Beach.
5. They are also used to transport sound equipment.
6. It's good to see you.
7. Lima is the capital of Peru.
8. Panasonic also sold the first bread machine.
9. He grew up into a musical family.
10. His executive powers are somewhat limited, though he is able to veto legislation.

4. Conclusion

This paper introduces GLOBE, a high-quality English corpus featuring worldwide accents originating from Common Voice, aimed at addressing the poor generalizability issue of current zero-shot speaker-adaptive TTS models. GLOBE not only matches the audio quality of popular TTS datasets like LibriTTS but also surpasses them by covering a broader range of worldwide accents and offering metadata for an extensive array of over 20,000 speakers. Our experiments demonstrate that speaker-adaptive TTS models trained on GLOBE achieve better generalizability than those trained on other datasets. We hope that the release of GLOBE will contribute to advancements in TTS research.