Skip to content

Proposal: specials revamp #216

@mttk

Description

@mttk

Issues with current Specials:

  • User can't modify the string value of the special
  • Text can't be printed out with join after reverse numericalize due to enum type (common use case)
original_text = ' '.join(vocab.reverse_numericalize(batch_x.text)) # itos contains both `str` and `SpecialVocabSymbols`

Idea:
Make Specials subclass str. Inheritance from Special base class is an identifier that a string is a Special. Each Special has a method that knows how to apply it to a token and/or sequence. Example:

class EOS(Special):
    def apply(self, sequence_or_token):
        # Core special, handled by Vocab
        if type(sequence_or_token) is str:
            raise ValueError("EOS can only be applied to sequence")
        elif type(sequence_or_token) is list:
            # Extend with self
            return sequence_or_token + [self.data]

class UNK(Special):
    def apply(self, sequence_or_token):
        # Core special, handled by Vocab
        pass

class Special(str):
    @abc.abstractmethod
    def apply(self, sequence_or_token):
        # Method is used ONLY in Vocab.numericalize
        if type(sequence_or_token) is str:
            # Apply to token
            pass
        elif type(sequence_or_token) in (list, tuple):
            # Apply to sequence
            pass

This allows us to:

eos = EOS('<eos>')
sequence = ['this', 'is', 'a', 'sequence']
print(' '.join(eos.apply(sequence)))
>>> this is a sequence <eos>

So the user can define (1) the string for the special and (2) is handled.
Up for discussion:

  1. Do we introduce an inheritance in specials? A natural inheritance is
  • CoreSpecial (PAD, UNK -- behavior hardcoded in vocab, apply isn't used)
  • TokenSpecial (Applied on token-level for efficiency. Example: substituting numbers with or masking tokens can be a special instead of a hook)
  • SequenceSpecial (anything that works on sequence level: EOS, BOS, maybe MASK)
  1. Referencing specials
  • The Vocab needs to find the core specials in order to provide them to Field (e.g. for padding)
  • The Vocab.padding_index method has to check for if PAD in self.stoi/self.specials (TBD: maybe make list of specials an attribute of vocab)
    • Proposal (maybe bad): make __hash__ and __equals__ of specials trigger on concrete class, and not string
      • Required: there can be only one of each special in the Vocab (natural)
      • Checking for if PAD in stoi would essentially check if stoi[idx] == PAD.__class__ instead of == str(PAD) (illustrative)
  • Alternative:
    - Check
for special in self.specials:
  if type(special) is PAD:
    return self.stoi[special]
- Requires storing all specials as attribute (probably nicer)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions