training data for vae

The technical report mentions that an audio VAE was trained using over 1 million music-related data samples. However, in my testing, I found that this VAE also performs quite well on speech. This makes me curious: in audio training, is audio quality more important than quantity? Is a dataset of 1 million audio samples sufficient?