数据集 | 开源语音数据库汇总

This is a curated list of open speech datasets for speech-related research (mainly for Automatic Speech Recognition). Over 110 speech datasets are collected, and more than 70 datasets can be downloaded directly without further registration or application.
GitHub Link: https://github.com/RevoSpeechTech/speech-datasets-collection
Contributions for more speech datasets are welcome!
You can issue here with new speech datasets, and the list of datasets will be updated Seasonly.
Notice:
- This repository does not show corresponding License of each dataset. Basically it’s OK to use these datasets for research purpose only. Please make sure the License is suitable before using for commercial purpose.
- Some small-scale speech corpora are not shown here for concision.
Data Overview
Dataset Acquisition | Sup/Unsup | All Languages (Hours) | Mandarin (Hours) | English (Hours) |
---|---|---|---|---|
download directly | supervised | 199k + | 2110 + | 34k + |
download directly | unsupervised | 530k + | 1360 + | 68k + |
download directly | total | 729k + | 3470 + | 102k + |
need application | supervised | 53k + | 16740 + | 50k + |
need application | unsupervised | 60k + | 12400 + | 57k + |
need application | total | 113k + | 29140 + | 107k + |
total | supervised | 252k + | 18850 + | 84k + |
total | unsupervised | 590k + | 13760 + | 125k + |
total | total | 842k + | 32610 + | 209k + |
- Mandarin here includes Mandarin-English CS corpora.
- Sup means supervised speech corpus with high-quality transcription.
- Unsup means unsupervised or weakly-supervised speech corpus.
List of ASR corpora
directly downloadable
id | Name | Language | Type/Domain | Paper Link | Data Link | Size (Hours) |
---|---|---|---|---|---|---|
1 | Librispeech | English | Reading | [paper] | [dataset] | 960 |
2 | TED_LIUM v1 | English | Talks | [paper] | [dataset] | 118 |
3 | TED_LIUM v2 | English | Talks | [paper] | [dataset] | 207 |
4 | TED_LIUM v3 | English | Talks | [paper] | [dataset] | 452 |
5 | MLS | Multilingual | Reading | [paper] | [dataset] | 50k + |
6 | thchs30 | Mandarin | Reading | [paper] | [dataset] | 35 |
7 | ST-CMDS | Mandarin | Commands | - | [dataset] | 100 |
8 | aishell | Mandarin | Recording | [paper] | [dataset] | 178 |
9 | aishell-3 | Mandarin | Recording | [paper] | [dataset] | 85 |
10 | aishell-4 | Mandarin | Meeting | [paper] | [dataset] | 120 |
11 | aishell-eval | Mandarin | Misc | - | [dataset] | 80 + |
12 | Primewords | Mandarin | Recording | - | [dataset] | 100 |
13 | aidatatang_200zh | Mandarin | Record | - | [dataset] | 200 |
14 | MagicData | Mandarin | Recording | - | [dataset] | 755 |
15 | MagicData-RAMC | Mandarin | Conversational | [paper] | [dataset] | 180 |
16 | Heavy Accent Corpus | Mandarin | Conversational | - | [dataset] | 58 + |
17 | AliMeeting | Mandarin | Meeting | [paper] | [dataset] | 120 |
18 | CN-Celeb | Mandarin | Misc | [paper] | [dataset] | unsup(274) |
19 | CN-Celeb2 | Mandarin | Misc | [paper] | [dataset] | unsup(1090) |
20 | The People’s Speech | English | Misc | [paper] | [dataset] | 30k + |
21 | Multilingual TEDx | Multilingual | Talks | [paper] | [dataset] | 760 + |
22 | VoxPopuli | Multilingual | Misc | [paper] | [dataset] | sup(1.8k) unsup(400k) |
23 | Libri-Light | English | Reading | [paper] | [dataset] | unsup(60k) |
24 | Common Voice (Multilingual) | Multilingual | Recording | [paper] | [dataset] | sup(15k) unsup(5k) |
25 | Common Voice (English) | English | Recording | [paper] | [dataset] | sup(2200) unsup(700) |
26 | JTubeSpeech | Japanese | Misc | [paper] | [dataset] | 1300 |
27 | ai4bharat NPTEL2020 | English(Indian) | Lectures | - | [dataset] | weaksup(15.7k) |
28 | open_stt | Russian | Misc | - | [dataset] | 20k + |
29 | ASCEND | Mandarin-English CS | Conversational | [paper] | [dataset] | 10 + |
30 | Crowd-Sourced Speech | Multilingual | Recording | [paper] | [dataset] | 1200 + |
31 | Spoken Wikipedia | Multilingual | Recording | [paper] | [dataset] | 1000 + |
32 | MuST-C | Multilingual | Talks | [paper] | [dataset] | 6000 + |
33 | M-AILABS | Multilingual | Reading | - | [dataset] | 1000 |
34 | CMU Wilderness | Multilingual | Misc | [paper] | [dataset] | unsup(14k) |
35 | Gram_Vaani | Hindi | Recording | [paper] [code] | [dataset] | sup(100) unsup(1k) |
36 | VoxLingua107 | Multilingual | Misc | [paper] | [dataset] | unsup(6600 +) |
37 | Kazakh Corpus | Kazakh | Recording | [paper] [code] | [dataset] | 335 |
38 | Voxforge | English | Recording | - | [dataset] | 130 |
39 | Tatoeba | English | Recording | - | [dataset] | 200 |
40 | IndicWav2Vec | Multilingual | Misc | [paper] | [dataset] | unsup(17k +) |
41 | VoxCeleb | English | Misc | [paper] | [dataset] | unsup(352) |
42 | VoxCeleb2 | English | Misc | [paper] | [dataset] | unsup(2442) |
43 | RuLibrispeech | Russian | Read | - | [dataset] | 98 |
44 | MediaSpeech | Multilingual | Misc | [paper] | [dataset] | 40 |
45 | MUCS 2021 task1 | Multilingual | Misc | - | [dataset] | 300 |
46 | MUCS 2021 task2 | Multilingual | Misc | - | [dataset] | 150 |
47 | nicolingua-west-african | Multilingual | Misc | [paper] | [dataset] | 140 + |
48 | Samromur 21.05 | Samromur | Misc | [code] | [dataset] [dataset][dataset] | 145 |
49 | Puebla-Nahuatl | Puebla-Nahuatl | Misc | [paper] | [dataset] | 150 + |
50 | Golos | Russian | Misc | [paper] | [dataset] | 1240 |
51 | ParlaSpeech-HR | Croatian | Parliament | [paper] | [dataset] | 1816 |
52 | Lyon Corpus | French | Recording | [paper] | [dataset] | 185 |
53 | Providence Corpus | English | Recording | [paper] | [dataset] | 364 |
54 | CLARIN Spoken Corpora | Czech | Recording | - | [dataset] | 1120 + |
55 | Czech Parliament Plenary | Czech | Recording | - | [dataset] | 444 |
56 | (Youtube) Regional American Corpus | English (Accented) | Misc | [paper] | [dataset] | 29k + |
57 | NISP Dataset | Multilingual | Recording | [paper] | [dataset] | 56 + |
58 | Regional African American | English (Accented) | Recording | [paper] | [dataset] | 130 + |
59 | Indonesian Unsup | Indonesian | Misc | - | [dataset] | unsup (3000+) |
60 | Librivox-Spanish | Spanish | Recording | - | [dataset] | 120 |
61 | AVSpeech | English | Audio-Visual | [paper] | [dataset] | unsup(4700) |
62 | CMLR | Mandarin | Audio-Visual | [paper] | [dataset] | 100 + |
63 | Speech Accent Archive | English | Accented | [paper] | [dataset] | TBC |
64 | BibleTTS | Multilingual | TTS | [paper] | [dataset] | 86 |
65 | NST-Norwegian | Norwegian | Recording | - | [dataset] | 540 |
66 | NST-Danish | Danish | Recording | - | [dataset] | 500 + |
67 | NST-Swedish | Swedish | Recording | - | [dataset] | 300 + |
68 | NPSC | Norwegian | Parliament | [paper] | [dataset] | 140 |
69 | CI-AVSR | Cantonese | Audio-Visual | [paper] | [dataset] | 8 + |
70 | Aalto Finnish Parliament | Finnish | Parliament | [paper] | [dataset] | 3100 + |
71 | UserLibri | English | Reading | [paper] | [dataset] | - |
72 | Ukrainian Speech | Ukrainian | Misc | - | [dataset] | 1300+ |
73 | UCLA-ASR-corpus | Multilingual | Misc | - | [dataset] | unsup(15k) sup(9k) |
74 | ReazonSpeech | Japanese | Misc | [paper] [code] | [dataset] | 15k |
75 | Bundestag | German | Debate | [paper] | [dataset] | sup(610) unsup(1038) |
need application
id | Name | Language | Type/Domain | Paper Link | Data Link | Size (Hours) |
---|---|---|---|---|---|---|
1 | Fisher | English | Conversational | [paper] | [dataset] | 2000 |
2 | WenetSpeech | Mandarin | Misc | [paper] | [dataset] | sup(10k) weaksup(2.4k) unsup(10k) |
3 | aishell-2 | Mandarin | Recording | [paper] | [dataset] | 1000 |
4 | aidatatang_1505zh | Mandarin | Recording | - | [dataset] | 1505 |
5 | SLT 2021 CSRC | Mandarin | Misc | [paper] | [dataset] | 400 |
6 | GigaSpeech | English | Misc | [paper] | [dataset] | sup(10k) unsup(23k) |
7 | SPGISpeech | English | Misc | [paper] | [dataset] | 5000 |
8 | AESRC 2020 | English (accented) | Misc | [paper] | [dataset] | 160 |
9 | LaboroTVSpeech | Japanese | Misc | [paper] | [dataset] | 2000 + |
10 | TAL_CSASR | Mandarin-English CS | Lectures | - | [dataset] | 587 |
11 | ASRU 2019 ASR | Mandarin-English CS | Reading | - | [dataset] | 700 + |
12 | SEAME | Mandarin-English CS | Recording | [paper] | [dataset] | 196 |
13 | Fearless Steps | English | Misc | - | [dataset] | unsup(19k) |
14 | FTSpeech | Danish | Meeting | [paper] | [dataset] | 1800 + |
15 | KeSpeech | Mandarin | Recording | [paper] | [dataset] | 1542 |
16 | KsponSpeech | Korean | Conversational | [paper] | [dataset] | 969 |
17 | RVTE database | Spanish | TV | [paper] | [dataset] | 800 + |
18 | DiDiSpeech | Mandarin | Recording | [paper] | [dataset] | 800 |
19 | Babel | Multilingual | Telephone | [paper] | [dataset] | 1000 + |
20 | National Speech Corpus | English (Singapore) | Misc | [paper] | [dataset] | 3000 + |
21 | MyST Children’s Speech | English | Recording | - | [dataset] | 393 |
22 | L2-ARCTIC | L2 English | Recording | [paper] | [dataset] | 20 + |
23 | JSpeech | Multilingual | Recording | [paper] | [dataset] | 1332 + |
24 | LRS2-BBC | English | Audio-Visual | [paper] | [dataset] | 220 + |
25 | LRS3-TED | English | Audio-Visual | [paper] | [dataset] | 470 + |
26 | LRS3-Lang | Multilingual | Audio-Visual | - | [dataset] | 1300 + |
27 | QASR | Arabic | Dialects | [paper] | [dataset] | 2000 + |
28 | ADI (MGB-5) | Arabic | Dialects | [paper] | [dataset] | unsup (3000 +) |
29 | MGB-2 | Arabic | TV | [paper] | [dataset] | 1200 + |
30 | 3MASSIV | Multilingual | Audio-Visual | [paper] | [dataset] | sup(310) unsup(600) |
31 | MDCC | Cantonese | Misc | [paper] | [dataset] | 73 + |
32 | Lahjoita Puhetta | Finnish | Misc | [paper] | [dataset] | sup(1600) unsup(2000) |
33 | SDS-200 | Swiss German | Dialects | [paper] | [dataset] | 200 |
34 | Modality Corpus | Multilingual | Audio-Visual | [paper] | [dataset] | 30 + |
35 | Hindi-Tamil-English | Multilingual | Misc | - | [dataset] | 690 |
36 | English-Vietnamese Corpus | English, Vietnamese | Misc | [paper] | [dataset] | 500+ |
37 | OLKAVS | Korean | Audio-Visual | [paper] [code] | [dataset] | 1150 |
References
- 本文标题:数据集 | 开源语音数据库汇总
- 创建时间:2023-01-07
- 本文链接:2023/01/07/speech-datasets-collection/
- 版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!