Hello, thank you for your excellent work.
I would like to reproduce your work on a new model. The download link for openwebtext you provided is no longer valid.
Although, as you mentioned in other issues, MiniLLM provides processed data, this repository only provides data tokenized with GPT2, which cannot be used to train other family models. Could you provide the original training corpus?
Hello, thank you for your excellent work.
I would like to reproduce your work on a new model. The download link for openwebtext you provided is no longer valid.
Although, as you mentioned in other issues, MiniLLM provides processed data, this repository only provides data tokenized with GPT2, which cannot be used to train other family models. Could you provide the original training corpus?