Skip to content

Implement the Tokenizer Metrics Script #21

@Malikeh97

Description

@Malikeh97
  • Apply DataTrove Data Segmenters for Language-Specific Word Tokenization

  • Create a list of candidate datasets for fertility, parity, and PCW analysis

  • Add stand alone functions for metric evaluation and visualization for the paper

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions