Hello, I saw "Specifically, bert-base-uncased/bert-base-cased/bert-base-german-cased are equipped with vocabularies containing 30k/29k/30k tokens, while the dictionary of bert-base-multilingual-cased contains 119k tokens, which is much larger because it consists of the common tokens among 104 languages. For each low-resource language considered in our experiments, directly loading the whole embedding matrix of the multilingual BERT model will waste a lot of GPU memory. Therefore we only consider tokens that appear in the training and validation set, and manually modify the checkpoint of the multilingual BERT to omit the embeddings of unused tokens. In this way, we obtain dictionaries that contain 24k/16k/17k/16k tokens for Ro/It/Es/Nl respectively, which ultimately save
around 77M parameters in average."in Appendix C of the original paper.
How did you do it in the code? What is manually modifying a checkpoint?
Looking forward to your reply, thank you very much
Hello, I saw "Specifically, bert-base-uncased/bert-base-cased/bert-base-german-cased are equipped with vocabularies containing 30k/29k/30k tokens, while the dictionary of bert-base-multilingual-cased contains 119k tokens, which is much larger because it consists of the common tokens among 104 languages. For each low-resource language considered in our experiments, directly loading the whole embedding matrix of the multilingual BERT model will waste a lot of GPU memory. Therefore we only consider tokens that appear in the training and validation set, and manually modify the checkpoint of the multilingual BERT to omit the embeddings of unused tokens. In this way, we obtain dictionaries that contain 24k/16k/17k/16k tokens for Ro/It/Es/Nl respectively, which ultimately save
around 77M parameters in average."in Appendix C of the original paper.
How did you do it in the code? What is manually modifying a checkpoint?
Looking forward to your reply, thank you very much