generated from carpentries-incubator/template
-
-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
When the class is first created we use:
class Our_Tokenizer:
def __init__(self):
# import spacy tokenizer/language model
self.nlp = en_core_web_sm.load()
self.nlp.max_length = 4500000 # increase max number of characters that spacy can process (default = 1,000,000)
def __call__(self, document):
tokens = self.nlp(document)
simplified_tokens = [str.lower(token.lemma_) for token in tokens]
return simplified_tokensThis issue relates to this line:
simplified_tokens = [str.lower(token.lemma_) for token in tokens]Using string comprehension like this makes it shorter, but then we have to explain list comprehension to learners. Not the worst thing.
However, when we incorporate stop words into the class, we use a for-loop:
simplified_tokens = []
for token in tokens:
if not token.is_stop and not token.is_punct:
simplified_tokens.append(str.lower(token.lemma_))Then we switch back to more complex list comprehension later:
simplified_tokens = [
token for token in tokens
if not token.is_stop
and not token.is_punct
and token.pos_ in {"ADJ", "ADV", "INTJ", "NOUN", "VERB"}
]We should either stick with list comprehension (and include a brief note about what that is) or stick to a for-loop approach throughout this episode.
Metadata
Metadata
Assignees
Labels
No labels