Issues with current Specials:
- User can't modify the string value of the special
- Text can't be printed out with join after reverse numericalize due to enum type (common use case)
original_text = ' '.join(vocab.reverse_numericalize(batch_x.text)) # itos contains both `str` and `SpecialVocabSymbols`
Idea:
Make Specials subclass str. Inheritance from Special base class is an identifier that a string is a Special. Each Special has a method that knows how to apply it to a token and/or sequence. Example:
class EOS(Special):
def apply(self, sequence_or_token):
# Core special, handled by Vocab
if type(sequence_or_token) is str:
raise ValueError("EOS can only be applied to sequence")
elif type(sequence_or_token) is list:
# Extend with self
return sequence_or_token + [self.data]
class UNK(Special):
def apply(self, sequence_or_token):
# Core special, handled by Vocab
pass
class Special(str):
@abc.abstractmethod
def apply(self, sequence_or_token):
# Method is used ONLY in Vocab.numericalize
if type(sequence_or_token) is str:
# Apply to token
pass
elif type(sequence_or_token) in (list, tuple):
# Apply to sequence
pass
This allows us to:
eos = EOS('<eos>')
sequence = ['this', 'is', 'a', 'sequence']
print(' '.join(eos.apply(sequence)))
>>> this is a sequence <eos>
So the user can define (1) the string for the special and (2) is handled.
Up for discussion:
- Do we introduce an inheritance in specials? A natural inheritance is
- CoreSpecial (PAD, UNK -- behavior hardcoded in vocab, apply isn't used)
- TokenSpecial (Applied on token-level for efficiency. Example: substituting numbers with or masking tokens can be a special instead of a hook)
- SequenceSpecial (anything that works on sequence level: EOS, BOS, maybe MASK)
- Referencing specials
- The Vocab needs to find the core specials in order to provide them to Field (e.g. for padding)
- The
Vocab.padding_index method has to check for if PAD in self.stoi/self.specials (TBD: maybe make list of specials an attribute of vocab)
- Proposal (maybe bad): make
__hash__ and __equals__ of specials trigger on concrete class, and not string
- Required: there can be only one of each special in the Vocab (natural)
- Checking for
if PAD in stoi would essentially check if stoi[idx] == PAD.__class__ instead of == str(PAD) (illustrative)
- Alternative:
- Check
for special in self.specials:
if type(special) is PAD:
return self.stoi[special]
- Requires storing all specials as attribute (probably nicer)
Issues with current Specials:
Idea:
Make Specials subclass
str. Inheritance from Special base class is an identifier that a string is a Special. Each Special has a method that knows how to apply it to a token and/or sequence. Example:This allows us to:
So the user can define (1) the string for the special and (2) is handled.
Up for discussion:
Vocab.padding_indexmethod has to check forif PAD in self.stoi/self.specials(TBD: maybe make list of specials an attribute of vocab)__hash__and__equals__of specials trigger on concrete class, and not stringif PAD in stoiwould essentially checkif stoi[idx] == PAD.__class__instead of== str(PAD)(illustrative)- Check