Skip to content

How to reduce the dictionary size of pre-training model? #5

@Surnal

Description

@Surnal

Hello, I saw "Specifically, bert-base-uncased/bert-base-cased/bert-base-german-cased are equipped with vocabularies containing 30k/29k/30k tokens, while the dictionary of bert-base-multilingual-cased contains 119k tokens, which is much larger because it consists of the common tokens among 104 languages. For each low-resource language considered in our experiments, directly loading the whole embedding matrix of the multilingual BERT model will waste a lot of GPU memory. Therefore we only consider tokens that appear in the training and validation set, and manually modify the checkpoint of the multilingual BERT to omit the embeddings of unused tokens. In this way, we obtain dictionaries that contain 24k/16k/17k/16k tokens for Ro/It/Es/Nl respectively, which ultimately save
around 77M parameters in average."in Appendix C of the original paper.
How did you do it in the code? What is manually modifying a checkpoint?
Looking forward to your reply, thank you very much

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions