- [ ] Apply DataTrove Data Segmenters for Language-Specific Word Tokenization - [ ] Create a list of candidate datasets for fertility, parity, and PCW analysis - [ ] Add stand alone functions for metric evaluation and visualization for the paper
Apply DataTrove Data Segmenters for Language-Specific Word Tokenization
Create a list of candidate datasets for fertility, parity, and PCW analysis
Add stand alone functions for metric evaluation and visualization for the paper