ReazonSpeech
~35,000 h open Japanese speech corpus collected from terrestrial TV broadcast streams with aligned Japanese transcriptions.
- Task: asr
- Languages: ja
- Hours: 35000
- Domain: tv/broadcast
- License: CDLA-Sharing-1.0
- Homepage: https://research.reazon.jp/projects/ReazonSpeech/
- Paper: https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf
Recommendation
The best choice for large-scale Japanese ASR pretraining or fine-tuning, given its scale and natural broadcast speech. Use is legally constrained to Japanese Copyright Act Art. 30-4 (text/data-mining R&D), so commercial deployment terms are restrictive; the dataset is gated.
Getting the data
Obtain from the dataset homepage.
HF dataset gated by agreement to use solely under Japanese Copyright Act Art. 30-4; sizes range 8.5 h to 35,000 h. Creation toolkit is Apache-2.0.
Suggested processing
A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.