Page 93 - IJEEE-2023-Vol19-ISSUE-1
P. 93

Alkabool, Abdullah, Zadeh, & Mahfooz                                                                                                | 89

problem as a NER token classification problem, Habiby             classification. The corpus contains over 25,000 student
formulates the problem as a Q&A problem, which allows             papers, all annotated by writing professionals [16]. To ensure
him to use a Q&A model. The Transformer model Habiby              that the dataset is as accurate as possible, each article is
chose to fine-tune is Roberta, a BERT-inspired model from         annotated using a double-blind scoring procedure and
Facebook. Habiby used a maximum length of 448 characters          reviewed by another third-party writing professional [17].
and a stride of 192 for his model and trained his model for 3     The content of this dataset is very good and very useful for
epochs. His F-1 overall is 0.453.                                 training/testing models; however, I believe some changes to
Roman et al. [12] used several machine learning techniques        the format of the dataset can be made through data
in their approach to the problem of classification of discourse   preprocessing.
elements. The first technique they used was weighted box
fusion, which combines the outputs of 10 different models             2) Data preprocessing
into a single decision. Most of the models used are variants          To preprocess the data for this model, I decided to
of the Deberta model and the Longformer model. After              reassemble the individual sentence sequences into a joint
obtaining the model results, the team used post-processing,       article. In the PERSUADE corpus, articles are divided into
such as fixing range predictions and utterance-specific rules,    sequences, each sequence representing a different discourse
to clean up the model's output after making the predictions.      element. I believe this is not the best way to optimize
The F-1 total is 0.74, and the model is trained for 5 epochs      transformer-based models as this use positional encoding.
on Nvidia's V100 32GB GPU and A100 40GB GPU.                      Positional encoding is a technique added to the Transformer
In this project, machine learning researcher Ali Habiby [13]      architecture because the model is acyclic, which means that
used a random forest model instead of his previous Q&A            "Hello World" and "World Hello" sequences look the same
model to solve the discourse element classification problem.      in the Transformer model [18]. By adding positional
One advantage of this model is that it is easy to understand      encoding to the word embedding, the Transformer model can
and replicate. The train/test split chosen by Habiby for this     learn that different word positions in the text have different
model is 70% train and 30% test, and the model has an             meanings, and I believe this tool can be used for discourse
overall f-1 value of 0.25. While this model is easy to replicate  classification purposes. This is because certain discourse
and understand, I think the model is too simplistic given the     elements, such as closing sentences, are highly correlated
low f-1 value to see how the different utterance elements are     with their position in the text; merging the sequences before
related to each other.                                            starting fine-tuning gives the model a chance to learn how
Lonnie [14] uses the Keras library to create an LSTM              the position of the sequence in the paper is related to its
network that can classify utterance elements in student           discourse type.
papers. One notable layer included in the Lonnie model is the
cushion layer of length 1024. This is important because most          3) Three different models (BERT, Longformer, and
other solutions are fine-tuned versions of the BERT model,                  GPT-2)
however, the BERT model can only hold 512 tokens at a
time. So Lonnie's model is better able to accommodate larger         The three models chosen for fine-tuning this document
student papers than most other solutions, but Lonnie still        are the BERT, Longformer, and GPT-2 models. I decided to
trains on one sequence of data at a time, which I think           refine some models because I wanted to see how different
prevents his model from reaching its full potential. Overall,     model architectures address the problem of discourse
the f-1 value of the Lonnie model is 0.214.                       classification. I was also interested in whether different
Drakuttala [15], a machine learning researcher, fine-tuned        models are better at classifying different elements of
the RoBERTa base model by addressing the discourse                discourse. We chose the BERT model because it is one of the
element classification problem. One thing that stands out         most popular models for NLP tasks. According to the Hug
about Drakuttala's method is that he clearly defined each         Face database, the BERT model was downloaded 15.8
element during the model training. Instead of using 7 classes     million times by researchers in April 2022, making it the
like most other researchers, he used Claim, Position, Lead        second most popular NLP model [19]. I chose to include this
and Counter Claim. Drakuttala organized their data into two       model in my own study so that my results could be compared
parts: B and I. Class I, like its name implies, is for words      with those of other researchers. Another model that I am
considered part of an entity. Drakuttala used this principle      improving is the GPT-2 model. This model is a popular
instead of one Lead class— instead, they created two Lead         model, but I included it in the project mainly because of the
classes, B-Lead and I-Lead. Drakuttala achieved a 0.54 f-1        model's design. Unlike his BERT model, which stacked the
score during training on 3 epochs with a 1e-5 learning rate       coding layers of the transformer, the GPT-2 architecture
and a 512 token length.                                           stacked the decoding layers of the transformer [20]. In this
                                                                  post, we want to see if this small design change affects the
      II. APPROACH (AND TECHNICAL CORRECTNESS)                    output of speech classification results. The last model that I
                                                                  will improve on, and one that I think is the most promising,
    1) PERSUADE corpus                                            is the Longformer model. The Longformer model is an
      The training and testing data used to fine-tune my          extension of the BERT model designed to handle larger input
                                                                  values without compromising quality [21]. This feature is
model is the PERSUADE corpus, a dataset created by                important for my research because data preprocessing
Learning Agency Lab. I chose this dataset because it is           produces long input values and most models forget what they
specially designed for the problem of discourse                   learned at the beginning of the sequence. The longformer
   88   89   90   91   92   93   94   95   96   97   98