数据集 | 开源语音数据库汇总
RevoSpeech Lv1

This is a curated list of open speech datasets for speech-related research (mainly for Automatic Speech Recognition). Over 110 speech datasets are collected, and more than 70 datasets can be downloaded directly without further registration or application.

GitHub Link: https://github.com/RevoSpeechTech/speech-datasets-collection
Contributions for more speech datasets are welcome!
You can issue here with new speech datasets, and the list of datasets will be updated Seasonly.

Notice:

  1. This repository does not show corresponding License of each dataset. Basically it’s OK to use these datasets for research purpose only. Please make sure the License is suitable before using for commercial purpose.
  2. Some small-scale speech corpora are not shown here for concision.

Data Overview

Dataset AcquisitionSup/UnsupAll Languages (Hours)Mandarin (Hours)English (Hours)
download directlysupervised199k +2110 +34k +
download directlyunsupervised530k +1360 +68k +
download directlytotal729k +3470 +102k +
need applicationsupervised53k +16740 +50k +
need applicationunsupervised60k +12400 +57k +
need applicationtotal113k +29140 +107k +
totalsupervised252k +18850 +84k +
totalunsupervised590k +13760 +125k +
totaltotal842k +32610 +209k +
  • Mandarin here includes Mandarin-English CS corpora.
  • Sup means supervised speech corpus with high-quality transcription.
  • Unsup means unsupervised or weakly-supervised speech corpus.

List of ASR corpora

directly downloadable

idNameLanguageType/DomainPaper LinkData LinkSize (Hours)
1LibrispeechEnglishReading[paper][dataset]960
2TED_LIUM v1EnglishTalks[paper][dataset]118
3TED_LIUM v2EnglishTalks[paper][dataset]207
4TED_LIUM v3EnglishTalks[paper][dataset]452
5MLSMultilingualReading[paper][dataset]50k +
6thchs30MandarinReading[paper][dataset]35
7ST-CMDSMandarinCommands-[dataset]100
8aishellMandarinRecording[paper][dataset]178
9aishell-3MandarinRecording[paper][dataset]85
10aishell-4MandarinMeeting[paper][dataset]120
11aishell-evalMandarinMisc-[dataset]80 +
12PrimewordsMandarinRecording-[dataset]100
13aidatatang_200zhMandarinRecord-[dataset]200
14MagicDataMandarinRecording-[dataset]755
15MagicData-RAMCMandarinConversational[paper][dataset]180
16Heavy Accent CorpusMandarinConversational-[dataset]58 +
17AliMeetingMandarinMeeting[paper][dataset]120
18CN-CelebMandarinMisc[paper][dataset]unsup(274)
19CN-Celeb2MandarinMisc[paper][dataset]unsup(1090)
20The People’s SpeechEnglishMisc[paper][dataset]30k +
21Multilingual TEDxMultilingualTalks[paper][dataset]760 +
22VoxPopuliMultilingualMisc[paper][dataset]sup(1.8k)
unsup(400k)
23Libri-LightEnglishReading[paper][dataset]unsup(60k)
24Common Voice (Multilingual)MultilingualRecording[paper][dataset]sup(15k)
unsup(5k)
25Common Voice (English)EnglishRecording[paper][dataset]sup(2200)
unsup(700)
26JTubeSpeechJapaneseMisc[paper][dataset]1300
27ai4bharat NPTEL2020English(Indian)Lectures-[dataset]weaksup(15.7k)
28open_sttRussianMisc-[dataset]20k +
29ASCENDMandarin-English CSConversational[paper][dataset]10 +
30Crowd-Sourced SpeechMultilingualRecording[paper][dataset]1200 +
31Spoken WikipediaMultilingualRecording[paper][dataset]1000 +
32MuST-CMultilingualTalks[paper][dataset]6000 +
33M-AILABSMultilingualReading-[dataset]1000
34CMU WildernessMultilingualMisc[paper][dataset]unsup(14k)
35Gram_VaaniHindiRecording[paper] [code][dataset]sup(100)
unsup(1k)
36VoxLingua107MultilingualMisc[paper][dataset]unsup(6600 +)
37Kazakh CorpusKazakhRecording[paper] [code][dataset]335
38VoxforgeEnglishRecording-[dataset]130
39TatoebaEnglishRecording-[dataset]200
40IndicWav2VecMultilingualMisc[paper][dataset]unsup(17k +)
41VoxCelebEnglishMisc[paper][dataset]unsup(352)
42VoxCeleb2EnglishMisc[paper][dataset]unsup(2442)
43RuLibrispeechRussianRead-[dataset]98
44MediaSpeechMultilingualMisc[paper][dataset]40
45MUCS 2021 task1MultilingualMisc-[dataset]300
46MUCS 2021 task2MultilingualMisc-[dataset]150
47nicolingua-west-africanMultilingualMisc[paper][dataset]140 +
48Samromur 21.05SamromurMisc[code][dataset] [dataset][dataset]145
49Puebla-NahuatlPuebla-NahuatlMisc[paper][dataset]150 +
50GolosRussianMisc[paper][dataset]1240
51ParlaSpeech-HRCroatianParliament[paper][dataset]1816
52Lyon CorpusFrenchRecording[paper][dataset]185
53Providence CorpusEnglishRecording[paper][dataset]364
54CLARIN Spoken CorporaCzechRecording-[dataset]1120 +
55Czech Parliament PlenaryCzechRecording-[dataset]444
56(Youtube) Regional American CorpusEnglish (Accented)Misc[paper][dataset]29k +
57NISP DatasetMultilingualRecording[paper][dataset]56 +
58Regional African AmericanEnglish (Accented)Recording[paper][dataset]130 +
59Indonesian UnsupIndonesianMisc-[dataset]unsup (3000+)
60Librivox-SpanishSpanishRecording-[dataset]120
61AVSpeechEnglishAudio-Visual[paper][dataset]unsup(4700)
62CMLRMandarinAudio-Visual[paper][dataset]100 +
63Speech Accent ArchiveEnglishAccented[paper][dataset]TBC
64BibleTTSMultilingualTTS[paper][dataset]86
65NST-NorwegianNorwegianRecording-[dataset]540
66NST-DanishDanishRecording-[dataset]500 +
67NST-SwedishSwedishRecording-[dataset]300 +
68NPSCNorwegianParliament[paper][dataset]140
69CI-AVSRCantoneseAudio-Visual[paper][dataset]8 +
70Aalto Finnish ParliamentFinnishParliament[paper][dataset]3100 +
71UserLibriEnglishReading[paper][dataset]-
72Ukrainian SpeechUkrainianMisc-[dataset]1300+
73UCLA-ASR-corpusMultilingualMisc-[dataset]unsup(15k)
sup(9k)
74ReazonSpeechJapaneseMisc[paper] [code][dataset]15k
75BundestagGermanDebate[paper][dataset]sup(610)
unsup(1038)

need application

idNameLanguageType/DomainPaper LinkData LinkSize (Hours)
1FisherEnglishConversational[paper][dataset]2000
2WenetSpeechMandarinMisc[paper][dataset]sup(10k)
weaksup(2.4k)
unsup(10k)
3aishell-2MandarinRecording[paper][dataset]1000
4aidatatang_1505zhMandarinRecording-[dataset]1505
5SLT 2021 CSRCMandarinMisc[paper][dataset]400
6GigaSpeechEnglishMisc[paper][dataset]sup(10k)
unsup(23k)
7SPGISpeechEnglishMisc[paper][dataset]5000
8AESRC 2020English (accented)Misc[paper][dataset]160
9LaboroTVSpeechJapaneseMisc[paper][dataset]2000 +
10TAL_CSASRMandarin-English CSLectures-[dataset]587
11ASRU 2019 ASRMandarin-English CSReading-[dataset]700 +
12SEAMEMandarin-English CSRecording[paper][dataset]196
13Fearless StepsEnglishMisc-[dataset]unsup(19k)
14FTSpeechDanishMeeting[paper][dataset]1800 +
15KeSpeechMandarinRecording[paper][dataset]1542
16KsponSpeechKoreanConversational[paper][dataset]969
17RVTE databaseSpanishTV[paper][dataset]800 +
18DiDiSpeechMandarinRecording[paper][dataset]800
19BabelMultilingualTelephone[paper][dataset]1000 +
20National Speech CorpusEnglish (Singapore)Misc[paper][dataset]3000 +
21MyST Children’s SpeechEnglishRecording-[dataset]393
22L2-ARCTICL2 EnglishRecording[paper][dataset]20 +
23JSpeechMultilingualRecording[paper][dataset]1332 +
24LRS2-BBCEnglishAudio-Visual[paper][dataset]220 +
25LRS3-TEDEnglishAudio-Visual[paper][dataset]470 +
26LRS3-LangMultilingualAudio-Visual-[dataset]1300 +
27QASRArabicDialects[paper][dataset]2000 +
28ADI (MGB-5)ArabicDialects[paper][dataset]unsup (3000 +)
29MGB-2ArabicTV[paper][dataset]1200 +
303MASSIVMultilingualAudio-Visual[paper][dataset]sup(310)
unsup(600)
31MDCCCantoneseMisc[paper][dataset]73 +
32Lahjoita PuhettaFinnishMisc[paper][dataset]sup(1600)
unsup(2000)
33SDS-200Swiss GermanDialects[paper][dataset]200
34Modality CorpusMultilingualAudio-Visual[paper][dataset]30 +
35Hindi-Tamil-EnglishMultilingualMisc-[dataset]690
36English-Vietnamese CorpusEnglish, VietnameseMisc[paper][dataset]500+
37OLKAVSKoreanAudio-Visual[paper] [code][dataset]1150

References