Skip to content

ReazonSpeech

~35,000 h open Japanese speech corpus collected from terrestrial TV broadcast streams with aligned Japanese transcriptions.

Recommendation

The best choice for large-scale Japanese ASR pretraining or fine-tuning, given its scale and natural broadcast speech. Use is legally constrained to Japanese Copyright Act Art. 30-4 (text/data-mining R&D), so commercial deployment terms are restrictive; the dataset is gated.

Getting the data

Obtain from the dataset homepage.

HF dataset gated by agreement to use solely under Japanese Copyright Act Art. 30-4; sizes range 8.5 h to 35,000 h. Creation toolkit is Apache-2.0.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.