Page 94 - IJEEE-2023-Vol19-ISSUE-1
P. 94
90 | Alkabool, Abdullah, Zadeh, & Mahfooz
model is important for my research because it uses data way to train a transformer-based discourse classification
preprocessing without sacrificing quality. I believe this architecture is to reassemble the sequences into a full article
model shows how far my pretreatment technology can go. and let the model use their positional encodings to explore
relationships between discourse elements.
4) Hyper-parameters
For the model hyper-parameters I used: Fig. 2: Macro f-1 scores of all models after training
• Batch-size = 1
• Learning-rate = 5e-5 Fig. 3 Accuracy during model training
• Epochs = 7
• Warm-up ratio = 0.1 2) Comparing transformer-based architecture for
• Gradient-accumulation = 8 discourse classification
• Weight-decay = 0.01 In my experiments, I distilled 3 Transformer models from
the Hugging face library: BERT, Longformer, and GPT2.
5) F-1 score From the "Trained models and their F-1 scores” table, we can
To evaluate the model, I will use the f-1 score. To see that of all the fine-tuned models, the Longformer model
calculate the f-1 score, use the following formula: has an f-1 value of 0.535, which is the best performer. The
BERT model ranks second with an f-1 scores of 0.395, and
F - 1 point = TP=(TP + 0:5 * (FP + F N)) the GPT2 model is the worst with a value of 0.362. My work
Before we can use this formula, we need to find true positive, here shows that the best model for discourse classification is
false positive and false negative values as defined by GSU the Longformer model. I believe that the longformer's ability
researchers. As the GSU team sees in this post, each model to handle large data inputs without losing important
evaluation will contain a ground truth and prediction. The information is why this model has been so successful in my
ground truth is which utterance class the sequence (phrase) experiments.
belongs to, and the prediction is which class the model thinks
the sequence belongs to. If the predicted sequence overlaps 3) High Lead/Concluding Statement scores
the ground truth sequence by 50% or more, it is considered a All models scored relatively high in the main and
true positive. If there is a mismatched predicted sequence conclusive claims category, and low in the counterclaim
then I consider it as a false positive, if there is a mismatched category. As shown in Fig. 4, the average f-1 score for
ground truth sequence then I consider it as a false positive. leading and trailing sentences are 0.751 and 0.587,
Figure 1 shows examples of these forecast sequences and respectively, the two highest among all categories. This goes
explains in more detail how they are calculated. against conventional wisdom, since cues and conclusive
statements are not as common as other categories (such as
Fig. 1 How TP/TN/FN are calculated. claims); one would assume that the claim category is the
highest because the model has more examples to train on. I
III. EXPERIMENTAL RESULTS (AND TECHNICAL believe these results arise because the opening and closing
CORRECTNESS) statements are closely related to their position in the paper.
That is, the introductory and concluding sentences almost
1) Data preprocessing and sequence merging always appear at the beginning and end of the job,
As you can see from the table "Trained models and their respectively, which makes it easier for the model to learn
F-1 scores", the BERT model trained without my data these positionally encoded classes. So, my work here shows
preprocessing method has a 17% lower f-1 value (Fig. 2),
with reduced accuracy 41% (Fig. 3) than using my data
preprocessed version of the model. This is because the main
and concluding statements almost always appear at the
beginning and end of the essay, respectively. My model was
able to leverage positional encoding and understand the
relationship between introductory and closing statements and
their positions in student essays. My work shows that the best