-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathmodel_training.py
More file actions
637 lines (447 loc) · 26.1 KB
/
model_training.py
File metadata and controls
637 lines (447 loc) · 26.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
# -*- coding: utf-8 -*-
"""Model_training.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1AwXEm3zgTKyR2fZOlHd5mx8seDy2FX7l
# Sentiment Analysis with Hugging Face
Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.
Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.
[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)
The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.
## Application of Hugging Face Text classification model Fune-tuning
Find below a simple example, with just `3 epochs of fine-tuning`.
Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)
# Variable definition:
tweet_id: Unique identifier of the tweet
safe_tweet: Text contained in the tweet. Some sensitive information has been removed like usernames and urls
label: Sentiment of the tweet (-1 for negative, 0 for neutral, 1 for positive)
agreement: The tweets were labeled by three people. Agreement indicates the percentage of the three reviewers that agreed on the given label. You may use this column in your training, but agreement data will not be shared for the test set.
"""
from google.colab import drive
drive.mount('/content/drive')
!pip install nfx
#Hugging face hub
!pip install huggingface_hub
!pip install neattext
!pip install datasets
!pip install transformers
from transformers import TrainingArguments
!pip install transformers[torch]
!pip install accelerate -U
!pip install -qU transformers datasets accelerate
!pip install -r /content/drive/MyDrive/Covid_Vaccine_Sentiment_Analysis-main/requirements.txt
!pip install wordcloud
!pip install download
!pip install nltk
!pip install transformers[torch]
!pip install accelerate -U
!pip install datasets
!pip install wordcloud
import os
import pandas as pd
#Import hugging face logging in
from huggingface_hub import notebook_login
#Dataset
from datasets import load_dataset
from sklearn.model_selection import train_test_split
#Visualization Libraries
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
# To extract hashtags
import neattext.functions as nfx
import re
import warnings
warnings.filterwarnings("ignore")
# from google.colab import drive
# drive.mount('/content/drive')
# Download stopwords - Stopwords are commonly used words "a", "the" , "an", "is", "are".
# Are removed since they dont carry significant meaning to the words
# import specific functions and classes from NLTK (Natural Language Toolkit library)
from nltk.tokenize import word_tokenize # used for tokenizing text into individual words
from nltk.corpus import stopwords # provides a list of common words that are often removed from text
from nltk.stem import PorterStemmer # is a stemming algorithm that reduces words to their base or root form
# Initializes stop variable, assigns it the list of English stopwords from the NLTK corpus.
import nltk
nltk.download("stopwords")
# creates an instance of the PorterStemmer class, assigns it to the variable stemmer.
# The stemmer will be used later to perform word stemming, which reduces words to their base or root
stemmer = PorterStemmer()
notebook_login()
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"
# Load the dataset and display some values
Link = 'https://raw.githubusercontent.com/Newton23-nk/Covid_Vaccine_Sentiment_Analysis/main/Datasets/Train.csv'
df = pd.read_csv(Link)
# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]
"""# Exploratory Data Analysis"""
df.head()
# We look at the number of positive, negative and neutral reviews
df.label.value_counts()
# The count of the agrremtns
df.agreement.value_counts()
# Legnth of the reviews
review_legnth = df.safe_text.str.len()
# Legnth of the longest review
max(review_legnth)
#Legnth of the shortest review
min(review_legnth)
# Length of Tweets
text_length = df['safe_text'].apply(len)
sns.histplot(text_length, kde=True)
plt.title('Distribution of Text Lengths')
plt.xlabel('Text Length')
plt.ylabel('Count')
plt.show()
"""The highest text length observed is 153 characters, while the minimum text length is 3 characters."""
# Distribution of Sentiments
sns.countplot(x='label', data=df)
plt.title('Distribution of Sentiments')
plt.xlabel('Sentiment Label')
plt.ylabel('Count')
plt.show()
"""The distribution of sentiments in the dataset, as depicted by the count plot, shows the prevalence of different sentiment labels within the Twitter posts related to COVID-19 vaccinations.
* Sentiment Label 0 (Neutral):
The sentiment label "0" (neutral) has the highest count, with approximately 5000 instances. This suggests that a significant portion of the collected tweets exhibit a neutral sentiment when it comes to discussing COVID-19 vaccinations. Neutral sentiments often indicate that the tweets may not strongly express positive or negative opinions but rather present factual information or observations.
* Sentiment Label 1 (Positive):
The sentiment label "1" (positive) follows with around 4000 instances. This indicates that a substantial number of tweets show a positive sentiment towards COVID-19 vaccinations. These tweets might express support for vaccinations, share positive experiences, or provide information about vaccination availability and benefits.
* Sentiment Label -1 (Negative):
The sentiment label "-1" (negative) has the lowest count, with approximately 1000 instances. This suggests that a relatively smaller portion of the collected tweets exhibit a negative sentiment towards COVID-19 vaccinations. Negative sentiments can encompass concerns, skepticism, or criticism about the vaccines, their safety, or potential side effects.
"""
# Distribution of Agreement Percentages
plt.hist(df['agreement'], color='blue')
plt.title('Distribution of Agreement Percentages')
plt.xlabel('Agreement Percentage')
plt.ylabel('Frequency')
plt.show()
"""* From the distribution plot, it is clear that the majority of tweets have an agreement percentage of 1.000000 (100% agreement among reviewers).
This means that for a significant portion of tweets, all three reviewers assigned the same sentiment label without disagreement.
* For a substantial number of tweets, two out of three reviewers agreed on the assigned sentiment label with a count of 3894.
* Finally, a smaller number of tweets have an agreement percentage of 0.333333, indicating that only one out of three reviewers agreed on the label.
"""
# Concatenate all text from the 'safe_text' column into a single string
text = ' '.join(df['safe_text'])
# Generate the word cloud with a white background
cloud_two_cities = WordCloud(width=800, height=400, background_color='white').generate(text)
# Display the word cloud
plt.figure(figsize=(8, 5))
plt.imshow(cloud_two_cities, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=1)
plt.show()
"""* The high frequency of "vaccine" and "vaccinate" aligns with the overarching theme of COVID-19 vaccinations. Neutral sentiment tweets may contain factual information, discussions, or updates related to the vaccines, contributing to a neutral tone.
* The term "measles" appearing prominently suggests that discussions within the neutral sentiment category often include references to the measles virus. It's possible that some tweets are drawing comparisons or discussing related topics in the context of COVID-19 vaccinations.
* The appearance of "kid" and "children" indicates that discussions involving younger individuals, possibly in the context of vaccination decisions for children, are present within the neutral sentiment tweets.
# Data Cleaning
"""
# Checking for missing values
df.isna().sum()
df.duplicated().sum()
"""We will extract hashtags and which can also used for analysis like which was the common aside from #Covid #Vaccine"""
# get hashtags
df['extract_hashtags'] = df['safe_text'].apply(nfx.extract_hashtags)
df[['extract_hashtags','safe_text']]
# remove hashtags from the column and save the cleaned text to clean text column
df['clean_text'] = df['safe_text'].apply(nfx.remove_hashtags)
# preview
df[['safe_text','clean_text']].head(10)
"""We will remove the user handles and the "RT" retweet indicator using the nfx.remove_userhandles in the nfx library."""
# remove RT and user handles
def removeRT(text):
return text.replace("RT" , "")
df['clean_text'] = df['clean_text'].apply(lambda x: nfx.remove_userhandles(x))
df['clean_text'] = df['clean_text'].apply(removeRT)
#Preview of the safe text and clean text
df[['safe_text','clean_text']].head(10)
"""We then remove multiple spaces and strip any leading or trailing spaces using the nfx.remove_multiple_spaces function and a custom function stripSpace."""
# remove multiple white spaces
def stripSpace(text):
return text.strip()
df['clean_text'] = df['clean_text'].apply(nfx.remove_multiple_spaces)
df['clean_text'] = df['clean_text'].apply(stripSpace)
"""To further reduce noise in the data and to remove irrelevant content, we will remove URLs from the data"""
# remove all urls
df['clean_text'] = df['clean_text'].apply(nfx.remove_urls)
df[['safe_text','clean_text']].head(10)
"""We will remove punctuation to standardize the data and to ensure consistency in the data"""
# remove pucntuations
df['clean_text'] = df['clean_text'].apply(nfx.remove_puncts)
df[['safe_text','clean_text']].head(10)
# lets check on null text, some might be cleaned everything
df.isna().sum()
"""We then remove punctuation from each hashtag and also remove the '#' symbol. This is to standardize hashtag representations."""
# lets get hashtags into a good string and remove the hashes beside the tag
def clean_hash_tag(text):
return " ".join([nfx.remove_puncts(x).replace("#", "") for x in text])
df['extract_hashtags'] = df['extract_hashtags'].apply(clean_hash_tag)
"""## Dealing with Emojis"""
# dealing with emojis
df['clean_text'].apply(nfx.extract_emojis)
# removing the emojis
df['clean_text'] = df['clean_text'].apply(nfx.remove_emojis)
# Replace '<user>' with an empty string in the 'clean_text' column
df['clean_text'] = df['clean_text'].str.replace('<user>', '')
df['clean_text'] = df['clean_text'].str.replace('@', '')
df['clean_text'] = df['clean_text'].str.replace('<url>', '')
df['clean_text'] = df['clean_text'].str.replace('measles', 'Measles')
df['clean_text'] = df['clean_text'].str.replace('“', '')
"""* * Remove ['vaccine', 'vaccines', 'vaccinate', 'vaccinated', 'vaccinations', 'vaccination'] to ['vaccine'] **"""
#We define the words to replace
words_to_replace = ['vaccine', 'vaccines', 'vaccinate', 'vaccinated', 'vaccinations', 'vaccination']
# Pattern to match any of the words in the list, using a regular expression
pattern = r'\b(?:{})\b'.format('|'.join(words_to_replace))
# Function to replace the words with 'vaccine'
def replace_with_vaccine(text):
return text.str.replace(pattern, 'vaccine', case=False)
# Apply the function to the 'safe_text' column
df['clean_text'] = replace_with_vaccine(df['clean_text'])
"""**4.8 Replace ['kids', 'child', 'children'] to ['child']**"""
words_to_replace_2 = ['kids', 'child', 'children']
# Pattern to match any of the words in the list, using a regular expression
pattern_2 = r'\b(?:{})\b'.format('|'.join(words_to_replace_2 ))
# Function to replace the words with 'vaccine'
def replace_with_child (text):
return text.str.replace(pattern_2 , 'child', case=False)
# Apply the function to the 'safe_text' column
df['clean_text'] = replace_with_child(df['clean_text'])
words_ = nltk.FreqDist(df['clean_text'].str.split().sum())
words = words_.most_common(30)
words
"""## Removing Stop words"""
",".join(stopwords.words('english'))
stop_words = set(stopwords.words('english'))
# Convert safe_text column to lower so as to apply stop words
df['clean_text'] = df['clean_text'].str.lower()
# remove stop words
def remove_stop (x):
return ",".join([word for word in str(x).split() if word not in stop_words])
df['clean_text'] = df['clean_text'].apply(lambda x : remove_stop(x) )
# To remove punctuations
df['clean_text'] = df['safe_text'].str.replace(r"[&;. ,#@\"!']", " ", regex=True)
"""## Use Stemmetizaton"""
ps = PorterStemmer()
final = []
for word in df['clean_text']:
final.append(ps.stem(word))
final.append(" ")
df.head()
# We Replace ['-', '"', 'u'] with [ ]
words_to_replace_3 = ['-', '"', 'u' ]
# Pattern to match any of the words in the list, using a regular expression
pattern_3 = r'\b(?:{})\b'.format('|'.join(words_to_replace_3 ))
# Function to replace the words with 'vaccine'
def replace_with_vaccine_3 (text):
return text.str.replace(pattern_3 , '', case=False)
# Apply the function to the 'safe_text' column
df['clean_text'] = replace_with_vaccine_3(df['clean_text'])
df = df[['tweet_id','clean_text','label','agreement']]
df.head()
df.info()
"""## Data Splitting
We will split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ).
There are multiple ways to do split the dataset.You will see two commented line showing you another one.
"""
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])
train.head()
eval.head()
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")
"""We will create a directory and save the 'train_subset.csv' and 'eval_subset.csv' files within that directory."""
# To create a director to hold the datasets
if not os.path.exists('LP5 Dataset'):
os.makedirs('LP5 Dataset')
# To check if the directory has been created
if os.path.exists('LP5 Dataset'):
print('The dataset directory exists')
else:
print("The dataset directory does not exist")
# Save splitted subsets in the dataset folder we created
train.to_csv("./LP5 Dataset/train_subset.csv", index=False)
eval.to_csv("./LP5 Dataset/eval_subset.csv", index=False)
# !pip uninstall --yes transformers
!pip install transformers
# !pip install --upgrade datasets
dataset = load_dataset('csv',
data_files={'train': './LP5 Dataset/train_subset.csv',
'eval': './LP5 Dataset/eval_subset.csv'}, encoding = "ISO-8859-1")
#To check if the dataset has been loaded properly
dataset
"""# BERT-BASED MODEL
BERT (Bidirectional Encoder Representations from Transformers) is a popular pre-trained model architecture for natural language understanding and processing. The "base" version of BERT, often referred to as "BERT base," represents a standard configuration of the BERT model. The "cased" variant of BERT retains the original casing of words in the pre-trained embeddings, which means it differentiates between uppercase and lowercase characters.
Here's what "BERT base cased" typically refers to:
1. **Base Model**: The "base" version of BERT usually refers to a model with a certain number of layers, hidden units, and attention heads. The specific architecture can vary, but it's a medium-sized version of BERT that balances computational cost and performance. For instance, it might have 12 layers, 768 hidden units, and 12 attention heads.
2. **Cased**: The "cased" variant of BERT retains the case information of words. This means that it treats "Word" and "word" as distinct tokens, capturing the difference in case. This can be important for tasks where the case of words carries semantic meaning.
BERT base cased is trained on a large corpus of text and can be fine-tuned for various natural language processing tasks, such as text classification, named entity recognition, question-answering, and more. Researchers and practitioners often use BERT base cased as a starting point for their NLP tasks because it provides strong out-of-the-box performance.
You can access pre-trained BERT base cased models and their corresponding tokenizers from the Hugging Face Transformers library, which is a popular library for working with pre-trained models in NLP. These models are available in various languages and can be loaded and fine-tuned for specific NLP tasks.
If you have more specific questions about using BERT base cased for a particular task or need code examples, feel free to ask!
## Preprocessing
Tokenization is the process of splitting raw text into smaller units called tokens. Tokens are typically words or subwords. Tokenization is a fundamental step in NLP because it breaks down text data into manageable units that can be processed by NLP models. In this case we are using a pretrained tokenizer from Hugging Face's Transformers library.
We will import the AutoTokenizer class from the Hugging Face Transformers library and then create an instance of the AutoTokenizer class and initializes it with a pre-trained BERT tokenizer.
"""
# Import Autokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
"""We define a function tokenize_function takes an input argument, which is expected to be a dictionary. This dictionary should contain a key 'text' that corresponds to the text data you want to tokenize. The function, then calls the tokenizer object that was initialized earlier. It tokenizes the text data provided.
Padding is the process of adding special tokens (often <PAD>) to the sequences to make them uniform in length. Setting padding='max_length' in the tokenizer ensures that all tokenized sequences have the same maximum length.
The padding argument specifies that the tokenized sequences should be padded to the maximum sequence length in the batch. Padding ensures that all sequences have the same length.
"""
#A function to tockenize the data
def tokenize_function(df):
return tokenizer(df['clean_text'], padding='max_length')
"""We apply the tokenize_data function to the dataset using the map function, in a
batched manner. This will tokenize each text field and padding the sequences to the maximum length in the batch.
The argument 'batched=True' allows to efficiently tokenize and pad multiple examples in a batch. Batch processing is essential for speeding up NLP tasks, as it allows the processing of multiple examples in parallel, which is especially beneficial when working with large datasets.
"""
# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_function, batched=True)
"""We define a function that will transform the sentiments labels from -1, 0, and 1 and converts them into a numerical format such that -1 becomes 0, 0 becomes 1, and 1 becomes 3."""
def transform_labels(label):
label = label['label']
num = 0
if label == -1: #'Negative'
num = 0
elif label == 0: #'Neutral'
num = 1
elif label == 1: #'Positive'
num = 2
return {'labels': num}
# Transform labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'agreement','clean_text']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)
dataset
!pip show accelerate
# !pip install transformers[torch]
# !pip install -qU transformers datasets accelerate
!pip install transformers[torch]
!pip install accelerate -U
# SPecifying the training arguments
# import transformers
# from transformers import TrainingArguments
# import accelerate
# # Configure the trianing parameters like `num_train_epochs`:
# # the number of time the model will repeat the training loop over the dataset
# training_args = TrainingArguments("Covid_Vaccine_Sentiment_Analysis_Bert_based_Model",
# num_train_epochs=10,
# load_best_model_at_end=True,
# push_to_hub=False,
# evaluation_strategy="steps",
# save_strategy="steps")
# SPecifying the training arguments
from transformers import TrainingArguments
# Configure the trianing parameters like `num_train_epochs`:
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments("Covid_Vaccine_Sentiment_Analysis_Bert_based_Model",
num_train_epochs=1,
load_best_model_at_end=True,
push_to_hub=True,
evaluation_strategy="steps",
save_strategy="steps")
"""The AutoModelForSequenceClassification provides a unified interface for loading various pre-trained models (like BERT, RoBERTa, etc.) and fine-tuning them for sequence classification tasks.WE will load the pre-trained BERT model and configures it for sequence classification with the specified number of labels"""
# We import the AutomodelForSequenceClassificatio class from the Transformers library
from transformers import AutoModelForSequenceClassification
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)
# We shuffle the dataset to randomnize the data and avoid any bias
train_dataset = dataset['train'].shuffle(seed=10)
eval_dataset = dataset['eval'].shuffle(seed=10)
## other way to split the train set ... in the range you must use:
# # int(num_rows*.8 ) for [0 - 80%] and int(num_rows*.8 ),num_rows for the 20% ([80 - 100%])
# train_dataset = dataset['train'].shuffle(seed=10).select(range(40000))
# eval_dataset = dataset['train'].shuffle(seed=10).select(range(40000, 41000))
"""We initialize the Trainer object from the Transformers library. The Trainer class is a high-level API that simplifies the training and evaluation of transformer-based models for various"""
# WE import the Trainer class from the Transformers library.
from transformers import Trainer
# We initialize the trainer object and specify the arguments
trainer = Trainer(
model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)
# Launch the learning process: training
trainer.train()
#To push the trained model into hugging face model hub
# trainer.push_to_hub()
# tokenizer.push_to_hub("NewtonKimathi/Covid_Vaccine_Sentiment_Analysis_Bert_based_Model")
"""Don't worry the above issue, it is a `KeyboardInterrupt` that means I stopped the training to avoid taking a long time to finish."""
import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
# Launch the final evaluation
trainer.evaluate()
"""# Model 2 : Roberta
"""
from transformers import RobertaForSequenceClassification, RobertaTokenizer
# Load pre-trained RoBERTa model and tokenizer
tokenizer_2 = RobertaTokenizer.from_pretrained("roberta-base")
model2 = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=3)
# have the tokenize function with the Tokenizer 2
def tokenize_function_2(df):
return tokenizer_2(df['clean_text'], padding="max_length")
# using the load_dataset function to load CSV files as datasets
dataset_2 = load_dataset('csv',
data_files={'train': './LP5 Dataset/train_subset.csv',
'eval': './LP5 Dataset/eval_subset.csv'}, encoding = "ISO-8859-1")
dataset_2
# Tokenize the dataset
# Changing the tweets into tokens our model can explot
dataset_2 = dataset_2.map(tokenize_function_2, batched=True)
dataset_2
def transform_labels(data):
label = data['label'] # extracts the value of the 'label' from the data input
num = 0
if label == -1: # 'Negative' sentiment
num = 0
elif label == 0: # 'Neutral' sentiment
num = 1
elif label == 1: # 'Positive' sentiment
num = 2
return {"labels": num}
# Assuming you are using the 'transform_labels' function for the mapping
drop = ['tweet_id', 'clean_text', 'label', 'agreement']
dataset_2 = dataset_2.map(transform_labels, remove_columns=drop)
dataset_2
# Shuffle the dataset
roberta_train_dataset = dataset_2["train"].shuffle(seed=50)#.take(subset_size)
roberta_eval_dataset = dataset_2["eval"].shuffle(seed=50)
roberta_eval_dataset
# SPecifying the training arguments
from transformers import TrainingArguments
# Configure the trianing parameters like `num_train_epochs`:
# the number of time the model will repeat the training loop over the dataset
training_args_2 = TrainingArguments("Covid_Vaccine_Sentiment_Analysis_Roberta_Model",
num_train_epochs=1,
load_best_model_at_end=True,
push_to_hub=True,
evaluation_strategy="steps",
save_strategy="steps")
# WE import the Trainer class from the Transformers library.
from transformers import Trainer
# Create a trainer
trainer_2 = Trainer(model = model2,args = training_args_2,train_dataset = roberta_train_dataset,
eval_dataset = roberta_eval_dataset)
trainer_2.train()
import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")
def compute_Metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
trainer_2 = Trainer(
model=model2,
args=training_args_2,
train_dataset=roberta_train_dataset,
eval_dataset=roberta_eval_dataset,
compute_metrics=compute_Metrics,
)
# trainer_2.evaluate()