Page 94 - IJEEE-2023-Vol19-ISSUE-1
P. 94

90 |                                                                                     Alkabool, Abdullah, Zadeh, & Mahfooz

model is important for my research because it uses data           way to train a transformer-based discourse classification
preprocessing without sacrificing quality. I believe this         architecture is to reassemble the sequences into a full article
model shows how far my pretreatment technology can go.            and let the model use their positional encodings to explore
                                                                  relationships between discourse elements.
    4) Hyper-parameters
For the model hyper-parameters I used:                                 Fig. 2: Macro f-1 scores of all models after training
 • Batch-size = 1
 • Learning-rate = 5e-5                                                           Fig. 3 Accuracy during model training
 • Epochs = 7
 • Warm-up ratio = 0.1                                                2) Comparing transformer-based architecture for
 • Gradient-accumulation = 8                                          discourse classification
 • Weight-decay = 0.01                                               In my experiments, I distilled 3 Transformer models from
                                                                  the Hugging face library: BERT, Longformer, and GPT2.
    5) F-1 score                                                  From the "Trained models and their F-1 scores” table, we can
    To evaluate the model, I will use the f-1 score. To           see that of all the fine-tuned models, the Longformer model
calculate the f-1 score, use the following formula:               has an f-1 value of 0.535, which is the best performer. The
                                                                  BERT model ranks second with an f-1 scores of 0.395, and
            F - 1 point = TP=(TP + 0:5 * (FP + F N))              the GPT2 model is the worst with a value of 0.362. My work
Before we can use this formula, we need to find true positive,    here shows that the best model for discourse classification is
false positive and false negative values as defined by GSU        the Longformer model. I believe that the longformer's ability
researchers. As the GSU team sees in this post, each model        to handle large data inputs without losing important
evaluation will contain a ground truth and prediction. The        information is why this model has been so successful in my
ground truth is which utterance class the sequence (phrase)       experiments.
belongs to, and the prediction is which class the model thinks
the sequence belongs to. If the predicted sequence overlaps           3) High Lead/Concluding Statement scores
the ground truth sequence by 50% or more, it is considered a         All models scored relatively high in the main and
true positive. If there is a mismatched predicted sequence        conclusive claims category, and low in the counterclaim
then I consider it as a false positive, if there is a mismatched  category. As shown in Fig. 4, the average f-1 score for
ground truth sequence then I consider it as a false positive.     leading and trailing sentences are 0.751 and 0.587,
Figure 1 shows examples of these forecast sequences and           respectively, the two highest among all categories. This goes
explains in more detail how they are calculated.                  against conventional wisdom, since cues and conclusive
                                                                  statements are not as common as other categories (such as
             Fig. 1 How TP/TN/FN are calculated.                  claims); one would assume that the claim category is the
                                                                  highest because the model has more examples to train on. I
      III. EXPERIMENTAL RESULTS (AND TECHNICAL                    believe these results arise because the opening and closing
                          CORRECTNESS)                            statements are closely related to their position in the paper.
                                                                  That is, the introductory and concluding sentences almost
    1) Data preprocessing and sequence merging                    always appear at the beginning and end of the job,
   As you can see from the table "Trained models and their        respectively, which makes it easier for the model to learn
F-1 scores", the BERT model trained without my data               these positionally encoded classes. So, my work here shows
preprocessing method has a 17% lower f-1 value (Fig. 2),
with reduced accuracy 41% (Fig. 3) than using my data
preprocessed version of the model. This is because the main
and concluding statements almost always appear at the
beginning and end of the essay, respectively. My model was
able to leverage positional encoding and understand the
relationship between introductory and closing statements and
their positions in student essays. My work shows that the best
   89   90   91   92   93   94   95   96   97   98   99