Skip to content

List comprehension in Our_Tokenizer class and episode flow #75

@srappel

Description

@srappel

When the class is first created we use:

class Our_Tokenizer:
  def __init__(self):
    # import spacy tokenizer/language model
    self.nlp = en_core_web_sm.load()
    self.nlp.max_length = 4500000 # increase max number of characters that spacy can process (default = 1,000,000)
  def __call__(self, document):
    tokens = self.nlp(document)
    simplified_tokens = [str.lower(token.lemma_) for token in tokens]
    return simplified_tokens

This issue relates to this line:

simplified_tokens = [str.lower(token.lemma_) for token in tokens]

Using string comprehension like this makes it shorter, but then we have to explain list comprehension to learners. Not the worst thing.

However, when we incorporate stop words into the class, we use a for-loop:

    simplified_tokens = []    
    for token in tokens:
        if not token.is_stop and not token.is_punct:
            simplified_tokens.append(str.lower(token.lemma_))

Then we switch back to more complex list comprehension later:

    simplified_tokens = [
      token for token in tokens
      if not token.is_stop
      and not token.is_punct
      and token.pos_ in {"ADJ", "ADV", "INTJ", "NOUN", "VERB"}
    ]

We should either stick with list comprehension (and include a brief note about what that is) or stick to a for-loop approach throughout this episode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions