UllahIhsan1
                     JamilAnum2
                     HassanImtiaz Ul3
                     KimByung-Seo1,*
               
                  - 
                           
                        (Department of Software and Communication Engineering, Hongik University, Korea
                        							danish1852@gmail.com, jsnbs@hongik.ac.kr
                        						)
                        
- 
                           
                        (Department of Physics, NED University of Engineering &Technology, Karachi, Pakistan
                        jamilanum47@gmail.com)
                        
- 
                           
                        (Department of Computer Science and Information Technology, NED University of Engineering
                        &Technology, Karachi, Pakistan   zahidbooni1@gmail.com)
                        
 
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Keywords
               
                Text mining,  Text classification,  Sentiment analysis,  Supervised machine learning,  BERT,  GRU,  LSTM
             
            
          
         
            
                  1. Introduction
               Natural disasters have become frequent worldwide, causing significant destruction
                  and loss of life. With the rise of social media platforms, particularly Twitter, people
                  now have an easy and immediate way of sharing information about these disasters. Twitter's
                  real-time nature enables people to post updates and emergency information about disasters
                  as they occur. The information shared on Twitter can benefit first responders and
                  disaster relief organizations because they can quickly assess the situation and allocate
                  resources accordingly. Studies have shown that social media platforms, such as Twitter,
                  can provide critical information to help manage natural disasters. One study examined
                  the role of Twitter in disseminating information during Hurricane Harvey in 2017 [1]. They reported that Twitter was a valuable tool for sharing situational updates and
                  emergency information, especially in the early stages of a disaster when the traditional
                  sources of information were limited. Another study analyzed the tweets during the
                  2017 Mexico earthquake and reported that Twitter users effectively shared information
                  about missing persons and relief efforts [2].
               
               On the other hand, the vast amount of unstructured and noisy data on Twitter poses
                  challenges for effective disaster response and management. Various Natural Language
                  Processing (NLP) techniques have been employed to classify and analyze disaster-related
                  tweets to address these challenges [3]. These techniques automatically categorize tweets into different types, such as informative,
                  supportive, and observational, to enable efficient filtering and analysis of disaster-related
                  information. 
               
               Several studies have demonstrated the effectiveness of NLP techniques in classifying
                  disaster-related tweets, such as the workers who used a deep learning approach to
                  classify tweets related to the California wildfires [4]. These studies showed promising results in tweet classification tasks using deep
                  learning models, such as recurrent neural networks (RNNs) and transformers. For example,
                  using a bi-directional long short-term memory (LSTM) model with an attention mechanism
                  to classify the tweets related to natural disasters into four categories: casualty,
                  damage, donations, and sentiment. They achieved an accuracy of 89.7 % and outperformed
                  several baseline models. Similarly, [5] used a pre-trained bidirectional encoder representations from transformers (BERT)
                  model to classify tweets related to the COVID-19 pandemic into four categories: news,
                  opinions, advisories, and miscellaneous.
               
               The present study compared the performance of three different NLP models, namely BERT,
                  gated recurrent units (GRU), and LSTM, for tweet classification of disaster data.
                  The proposed studies provide significant contributions to the field of crisis informatics,
                  particularly in the use of natural language processing (NLP) models for disaster detection
                  and response. The specific contributions of this research can be encapsulated in the
                  following points.\begin{enumerate}[1.]
               
               1. This paper presents a unique study comparing three distinct NLP models on disaster-related
                  tweet classification, a topic previously unexplored. A dataset of 5545 tweets was
                  manually annotated to assess the strengths and limitations of each model and guide
                  future research in this domain.
               
               2. This work introduces a robust framework for extracting disaster-relevant information
                  from Twitter, aiming to enhance the efficiency and depth of disaster management strategies
                  by interpreting social media data more effectively.
               
               3. This study aimed to develop a mechanism for identifying and categorizing disaster-related
                  tweets to sift through the vast Twitter data. The goal is to provide real-time updates
                  and emergency information during natural disasters, enabling stakeholders to gain
                  immediate insights and respond quickly and effectively.
               
               This research aims to demonstrate the potential of sophisticated NLP techniques in
                  aiding disaster response and management.
               
             
            
                  2. Related Work
               Natural disasters, such as earthquakes, floods, accidents, and hurricanes, have significant
                  social and economic impacts on the affected communities. Social media platforms, such
                  as Twitter, have emerged as valuable sources of real-time information during disasters
                  [6]. Twitter users often share first-hand accounts, photographs, and videos of the disasters,
                  as well as requests for help, information, and donations [7]. On the other hand, the vast amount of unstructured and noisy data on Twitter poses
                  challenges to effective disaster response and management.
               
               In recent years, there has been a growing interest in leveraging social media for
                  disaster management and response. A previous study developed a system to utilize Twitter
                  data for coordination in disaster response scenarios [8]. Their study focused on clustering tweets and categorizing them based on their relevance
                  to disasters. They demonstrated how social media can serve as a real-time source of
                  disaster-related information.
               
               On the other hand, the task of tweet classification has proven to be a challenge because
                  of the short, noisy, and unstructured nature of the text. Studies have made into this
                  problem, examining the use of convolutional neural networks (CNNs) for text classification
                  [9]. They showed that CNNs can effectively handle the short and sparse nature of tweets,
                  paving the way for a further exploration of deep learning techniques in this context.
               
               Despite this, research has shown that different deep learning models may perform better
                  on different tasks. A study examined the performance of several models, including
                  LSTM and GRU, in the context of sentiment analysis. The research found that GRU models
                  generally outperformed LSTM models, highlighting the need for further investigation
                  into the optimal contexts for each model. 
               
               Despite the initial impressions, various studies highlight that the effectiveness
                  of different deep learning models can hinge heavily on the task at hand. For example,
                  a notable study examined several models, including LSTM and GRU, within the realm
                  of sentiment analysis [10]. This investigation illuminated the comparative effectiveness of these models, revealing
                  a general trend of GRU models outperforming their LSTM counterparts. The possible
                  causes for this difference can be attributed to the unique structural and functional
                  characteristics of GRUs, including their simplified gating mechanism and lower computational
                  complexity, which may have advantages in specific scenarios, such as sentiment analysis.
                  Such nuanced performance disparities underline the criticality of choosing an appropriate
                  deep learning model based on the distinct requirements and nature of the task. Therefore,
                  these findings underscore the importance of further detailed, task-specific research
                  to unearth the optimal model-context pairings, enhancing the knowledge surrounding
                  comprehensive model evaluations and benchmarking studies across many tasks.
               
               The present study compared the performance of three NLP models, such as BERT, GRU,
                  and LSTM, for tweet classification of disaster data. To the best of the authors’ knowledge,
                  no study has compared the performance of these models on disaster-related tweets.
               
             
            
                  3. The Proposed Scheme
               This section describes the dataset used for disaster tweet classification, including
                  data collection and preprocessing information. The section also presents the proposed
                  methodology for the classification task on the collected dataset.
               
               
                     3.1 Data Collection 
                  The Tweepy library, a popular Python package, was used for data collection. Tweepy
                     provides a convenient and easy-to-use interface for accessing the Twitter API. With
                     Tweepy, researchers could authenticate and establish a connection with the Twitter
                     platform, enabling them to retrieve tweets based on specific search queries and hashtags
                     related to disasters. This study used hashtags, such as \#Disaster, \#Earthquake,
                     \#Floods, \#Accidents, and \#Disasters, to collect many disaster-related tweets for
                     further analysis and classification. This library streamlined the data collection
                     process and ensured the inclusion of relevant tweets on different types of disasters.
                     A dataset of 5545 tweets was collected using Tweepy, providing a diverse and comprehensive
                     dataset for analysis. 
                  
                
               
                     3.2 Data Annotation 
                  The 5545 collected tweets were manually annotated into different disaster categories,
                     including 'Earthquake', 'Flood', 'Accident', and 'Other disaster'. The annotation
                     process carefully analyzes the content of each tweet and assigns it to the appropriate
                     category based on its context and keywords. This manual annotation was carried out
                     using a team of trained annotators who followed a predefined set of guidelines to
                     ensure consistency and accuracy in the categorization.
                  
                
               
                     3.3 Data Preprocessing 
                  In the data preprocessing phase, several steps were undertaken to prepare the collected
                     tweets for further analysis. Initially, common words that do not carry significant
                     meaning, such as stop words, were removed from the text. This step helped reduce noise
                     and improve the efficiency of subsequent processes. Furthermore, a technique called
                     lemmatization was applied to transform words into their base or root form, consolidating
                     the variations of the same word. This step enhanced the accuracy of the classification
                     task by reducing the dimensionality of the data and capturing the essence of the tweet
                     content. The dataset was refined and optimized for subsequent analysis and classification
                     tasks by performing these preprocessing steps. A team of trained annotators followed
                     a predefined set of guidelines to ensure consistency and accuracy in the categorization.
                  
                
               
                     3.4 Data Visualization 
                  Data visualization is a powerful tool that provides insights, identifies patterns,
                     and communicates complex information effectively through visual representations. A
                     count plot was produced to visualize the distribution of different disaster types
                     in the dataset (Fig. 1). The data showed that during the collection period, the highest number of incidents
                     recorded was related to earthquakes, with a count of 2065. This was followed by other
                     disasters with 1348 occurrences, floods with 1215 occurrences, and accidents with
                     917 occurrences. The higher count of earthquake incidents can be attributed to the
                     data being collected when a significant earthquake event occurred in Turkey.
                  
                  Another type of visualization that can be performed on textual data is wordcloud.
                     The word cloud produced from a dataset of disaster tweets revealed the prominent terms
                     associated with different types of disasters. The most frequent terms in the word
                     cloud, which can be observed in Fig. 2, include "Earthquake," "Hurricane," "Accident," and "Flood Warning." These terms
                     indicate the prevalence of these specific disaster types in the dataset and highlight
                     the significance of these events in the context of the analyzed tweets. The word cloud
                     provides a visual representation that quickly identifies the most commonly mentioned
                     disaster types in the dataset.
                  
                  
                        Fig. 1. Tweet counts of each class.
 
                  
                        Fig. 2. Word Cloud for all the tweets.
 
                
               
                     3.5 Data Transformation 
                  This section performs data preparation that includes label encoding, tokenization,
                     text-to-sequence conversion, and padding to prepare the text data for disaster tweet
                     classification.
                  
                  Label encoding is a technique used to convert categorical labels into numerical values.
                     In disaster tweet classification, the labels ['Accident', 'Earthquake', 'Flood', and
                     'Other disaster'] were assigned the corresponding numerical labels 0, 1, 2, and 3,
                     respectively, using the scikit-learn label encoder. This allows the machine learning
                     model to understand and process the labels effectively.
                  
                  Tokenization, however, is the process of splitting text into individual words or tokens.
                     In this case, the vocabulary size was determined to be 14509, meaning there are 14509
                     unique words in the given disaster tweet dataset. Tokenization is an essential step
                     in natural language processing tasks because it allows the model to understand and
                     analyze the text data at a granular level.
                  
                  After tokenization, the next step involved converting the text data into sequences.
                     This conversion is necessary to represent each word in the text as a numerical sequence
                     that machine learning models can process. Each unique word in the vocabulary is assigned
                     a unique integer value. The conversion of text into sequences allows the model to
                     understand and analyze the text data numerically.
                  
                  Following tokenization, a maximum length of 27 words was set as the longest length
                     of a tweet. This was done by padding the sentences with zeros (post-zero-padding)
                     to ensure all sentences have the same length. This uniformity in sentence length is
                     beneficial for training machine learning models that require fixed-length input sequences.
                     By performing this preprocessing step, the text data is ready to be fed into a model
                     for disaster tweet classification.
                  
                
               
                     3.6 Data Splitting: Training, Validation, and Testing Sets
                  Data splitting is a crucial step in machine learning, where the available dataset
                     is divided into separate subsets, such as training, validation, and testing sets,
                     to facilitate model development, evaluation, and optimization. In the case of 5545
                     tweets, the data was divided into 15% for testing (831 tweets), 10% for validation
                     (554 tweets), and 75% for training (4160 tweets).
                  
                
               
                     3.7 Model Selection 
                  Once the data have been prepared for training, deep learning models are trained on
                     this data for performing disaster tweet classification. Three different models are
                     trained and compared: GRU, LSTM, and BERT.
                  
                  
                        3.7.1 GRU (Gated Recurrent Unit)
                     The GRU introduced by Cho et al. [11] is a type of RNN that has gained popularity in deep learning. It is designed to address
                        the vanishing gradient problem in traditional RNNs. The GRU includes two key Gates:
                        update and reset gates. The update gate determines how much of the previous hidden
                        state should be passed on to the current time step, while the reset gate controls
                        how much of the previous hidden state should be ignored. These gates play a crucial
                        role in governing the flow of information in the GRU, allowing it to capture long-term
                        dependencies in sequential data.
                     
                     Eq. (1) depicts the functioning of the Update gate, a key component in the Gated Recurrent
                        Unit (GRU) architecture.
                     
                     
                     where Z$_{\mathrm{t}}$ represents the update gate activation at time step t. ${\sigma}$
                        denotes the sigmoid activation function. W$_{\mathrm{z}}$ and U$_{\mathrm{z}}$ are
                        the weight matrices that control the influence of the current input x$_{\mathrm{t}}$
                        and the previously hidden state h$_{\mathrm{t-1}}$, respectively.
                     
                     Eq. (2) captures the functionality of the Reset gate.
                     
                     
                     Similarly, r$_{\mathrm{t}}$ is the reset gate activation at time step t. W$_{\mathrm{r}}$
                        and U$_{\mathrm{r}}$ are the weight matrices determining the impact of the current
                        input x$_{\mathrm{t}}$ and the previously hidden state h$_{\mathrm{t-1}}$ on the reset
                        gate activation.
                     
                     These equations and subsequent calculations help the GRU model decide how much information
                        to retain from the previous time step and how much to update with new inputs, enabling
                        it to capture and process sequential dependencies effectively.
                     
                     The GRU-based model initiates with an embedding layer that converts integer-encoded
                        words into dense vectors using the given vocabulary size and embedding dimensions.
                        This is succeeded by a Bidirectional GRU layer with 256 neurons, using a ReLU activation
                        for adept bidirectional sequence processing. A Global Average Pooling1D layer then
                        summarizes this temporal information. A dense layer with 64 neurons and ReLU activation
                        is then used, followed by a 0.4 rate dropout layer to mitigate overfitting. The architecture
                        culminates in a Dense layer with four neurons and a softmax activation, targeting
                        the classification of distinct disaster classes in tweets.
                     
                   
                  
                        3.7.2 LSTM (Long Short-term Memory)
                     LSTM is a well-established RNN architecture that effectively addresses the vanishing
                        gradient problem, a common issue in training traditional RNNs. The model achieves
                        this by introducing memory cells and three essential gating mechanisms: the input
                        gate, forget gate, and output gate. These gates play a critical role in regulating
                        the flow of information through the network, enabling the LSTM to capture and retain
                        long-range dependencies in the input sequence. LSTM has widespread applications in
                        various tasks, including speech recognition, language modeling, and text classification,
                        owing to its robustness in modeling sequential data. Its exceptional ability to capture
                        long-term dependencies makes it particularly well-suited for understanding the context
                        and semantics of the text, which is essential for accurate classification, such as
                        in disaster-related tweets.
                     
                     Input Gate: Eq. (3) represents the functioning of the input gate (i$_{\mathrm{t}}$). The input gate controls
                        how much the current input (x$_{\mathrm{t}}$) should be used to update the cell state
                        (C$_{\mathrm{t}}$). It is calculated using the sigmoid activation function.
                     
                     
                     where W$_{\mathrm{i}}$ is the weight matrix for the input gate and [h$_{\mathrm{t-1}}$,
                        x$_{\mathrm{t}}$] represents the concatenation of the previous hidden state and the
                        current input. b$_{\mathrm{i}}$ is the bias vector for the input gate. Sigmoid is
                        the activation function, which scales the output between 0 and 1.
                     
                     Forget Gate (f$_{\mathrm{t}}$): The forget gate determines the extent to which the
                        previous cell state (C$_{\mathrm{t-1}}$) should be forgotten when processing the current
                        input (x$_{\mathrm{t}}$) and the previous hidden state (h$_{\mathrm{t-1}}$). The gate
                        is also calculated using the sigmoid activation function. The forget gate mathematical
                        functioning can be explained using Eq. (4).
                     
                     
                     where W$_{\mathrm{f}}$ is the weight matrix for the forget gate. [h$_{\mathrm{t-1}}$,
                        x$_{\mathrm{t}}$] represents the concatenation of the previous hidden state and the
                        current input. b$_{\mathrm{f}}$ is the bias vector for the forget gate. "sigmoid"
                        is the sigmoid activation function.
                     
                     Output Gate (O$_{\mathrm{t}}$): The output gate controls the extent to which the current
                        cell state (C$_{\mathrm{t}}$) should influence the computation of the current hidden
                        state (h$_{\mathrm{t}}$). The gate is calculated using the sigmoid activation function.
                        Eq. (5) represents the mathematical equation of the output gate.
                     
                     
                     where Wo is the weight matrix for the output gate. [h$_{\mathrm{t-1}}$, x$_{\mathrm{t}}$]
                        represents the concatenation of the previous hidden state and the current input. b$_{\mathrm{o}}$
                        is the bias vector for the output gate.
                     
                     These gating mechanisms in LSTM and the memory cell enable the network to update and
                        forget information selectively, allowing it to learn long-term dependencies and effectively
                        model sequential data. The ability to capture complex patterns and context in the
                        input sequence makes LSTM a powerful tool for various natural language processing
                        tasks, including disaster-related tweet classification, where an accurate understanding
                        of the text's semantics is crucial.
                     
                     The architecture of an LSTM-based model begins with an embedding layer, which transforms
                        integer-encoded words into dense vectors. This feeds into a Bidirectional LSTM with
                        256 neurons, enhanced by a ReLU activation for efficient bidirectional sequence processing.
                        A Global Average Pooling1D layer then distills this temporal data, leading to a Dense
                        layer with 64 neurons and ReLU activation for intricate pattern recognition. A dropout
                        layer with a 0.4 rate was employed to prevent overfitting. Concluding the architecture,
                        a softmax-activated dense layer outputs class probabilities, making this design particularly
                        adept at classifying disaster-related tweets.
                     
                   
                  
                        3.7.3 BERT (Bidirectional Encoder Representation for Transformers)
                     BERT is a transformer-based model that has revolutionized natural language processing
                        tasks. The BERT model follows a two-step framework: pre-training and fine-tuning [12]. In the pretraining phase, the model undergoes training on a vast unlabeled corpus.
                        For the fine-tuning stage, the model starts with pre-trained parameters, which are
                        then fine-tuned using labeled data specific to the tasks.
                     
                     As a transformer-based model, BERT has revolutionized natural language processing
                        tasks with its bidirectional capabilities. Unlike LSTM and GRU, which are unidirectional
                        models processing input from left to right, BERT considers both the left and right
                        contexts of each word in a sentence, providing a more comprehensive understanding
                        of the context. This bidirectional nature allows BERT to capture long-range dependencies
                        efficiently. In addition, BERT differs from LSTM and GRU regarding the training objectives.
                        BERT uses unsupervised pretraining, learning from large amounts of unannotated text
                        data through masked language modeling and next-sentence prediction. In contrast, LSTM
                        and GRU typically undergo supervised training with labeled data.
                     
                     BERT uses a multi-layer bidirectional transformer encoder [12], which consists of N = 6 identical layers, each with two sub-layers. In the initial
                        sub-layer, a multi-head self-attention mechanism captures the relationships between
                        different words in the input sequence, allowing the model to comprehend the context
                        effectively. The subsequent sub-layer uses a position-wise fully connected feedforward
                        network to process the output of the self-attention layer further. The scaled dot-product
                        attention mechanism, represented as Eq. (6), is a fundamental building block within the self-attention layer.
                     
                     
                     where Q, K, and V represent the queries, keys, and values, respectively. This mechanism
                        calculates attention scores by measuring the relevance of the queries to the keys.
                        The softmax function normalizes these scores, determining the importance of each value
                        (V) with respect to the given queries and keys. By focusing on the most relevant parts
                        of the input sequence, this mechanism captures the contextual dependencies, leading
                        to meaningful contextualized word representations. This mechanism is a critical component
                        in the Transformer architecture, contributing to the success of models, such as BERT,
                        in various natural language processing tasks.
                     
                     The "Bert-base-uncased" variant of BERT, which comprises 12 layers, was used as the
                        foundation for classifying disaster-related tweets via transfer learning. This base
                        model is augmented using a fully connected output layer with neurons with a softmax
                        activation. Fine-tuning is facilitated using the Adam optimizer with a learning rate
                        of 1 ${\times}$ 10$^{-5}$ and a decay of 1${\times}$ 10$^{-7}$, optimizing the training
                        efficacy. Categorical Crossentropy was selected as the loss function, given its suitability
                        in measuring the discrepancies between predicted and actual class probabilities, making
                        it particularly suitable for multi-class classification.
                     
                   
                
               
                     3.8 Evaluation Metrics 
                  The evaluation metrics are essential for assessing the performance of machine learning
                     models. In this study, the performance of the GRU, LSTM, and BERT machine learning
                     models can be assessed using three pivotal metrics: accuracy, precision, and recall.
                     These metrics provide a comprehensive insight into the ability of the model to make
                     correct predictions, its proportion of true positive predictions, and its sensitivity
                     to positive instances. 
                  
                  Eq. (7) measures the overall correctness of the model predictions. This is the proportion
                     of the total number of correct predictions. Mathematically, accuracy can be expressed
                     as follows:
                  
                  
                  Eq. (8), also known as the positive predictive value, quantifies the proportion of positive
                     class predictions that are correct. The value measures the model reproducibility or
                     the closeness of the measurements to each other.
                  
                  
                  Eq. (9) is Recall, also known as sensitivity, hit rate, or true positive rate, quantifies
                     the proportion of actual positive class observations that were correctly classified.
                     This value is a measure of the completeness or the quantity it can correctly identify.
                  
                  
                  where T$_{\mathrm{P}}$, T$_{\mathrm{N}}$, F$_{\mathrm{P}}$, and F$_{\mathrm{N}}$ are
                     the true positives, true negatives, false positives, and false negatives, respectively.
                  
                  These evaluation metrics are crucial for understanding the strengths and weaknesses
                     of each model in different aspects of performance. This study aimed to determine if
                     the model exhibits the optimal performance for a specific task by comparing these
                     metrics across the GRU, LSTM, and BERT models.
                  
                
             
            
                  4. Results and Discussions
               This section discusses the results of the training of disaster tweet classification
                  models. The training was conducted in Google Colab, which offered GPU acceleration.
                  In particular, the GPU used for training was GPU 0: Tesla T4 with a memory capacity
                  of 15360MiB, which is equivalent to 16 GB. This GPU acceleration, along with its high
                  memory capacity, provided significant computational power and helped improve the efficiency
                  of the training process.
               
               
                     4.1 Comparison of GRU, LSTM, and BERT Models
                  Table 1 compares the results of the three models based on accuracy, precision, and recall
                     of test data. According to the results presented in Table 1, BERT achieved the highest testing accuracy (0.962), followed by LSTM with a testing
                     accuracy of 0.932 and GRU with a testing accuracy of 0.8847. BERT also had the highest
                     testing precision (0.963), indicating that BERT, a powerful language model, achieves
                     impressive results in various natural language processing tasks, including text classification.
                     In the specific disaster classification task, BERT showed its effectiveness in accurately
                     predicting the class of a given text. Fig. 3 presents the confusion matrix for the performance of BERT on the disaster classification
                     task, using the classes disaster classes 'Accident', 'Earthquake', 'Flood', and 'Other
                     disaster'. The confusion matrix provides valuable insights into the performance of
                     the model by showing the number of correct and incorrect predictions for each class.
                     In this case, the rows represent the true classes, while the columns represent the
                     predicted classes.
                  
                  From the confusion matrix, BERT has achieved high accuracy in predicting the 'Accident',
                     'Earthquake', and 'Other disaster' classes, complexity, with most predictions falling
                     into these categories being correct.
                  
                  On the other hand, there are few instances where 'Accident' correctly identifies positive
                     instances out of all instances predicted as positive. LSTM also performed well in
                     this aspect, with a precision of 0.952. In contrast, GRU had a slightly lower precision
                     of 0.8923. Regarding testing recall, BERT achieved the highest score of 0.9625, followed
                     by LSTM with a recall of 0.917. GRU had a slightly lower recall of 0.8811. BERT demonstrated
                     the best performance across all metrics, achieving high accuracy, precision, and recall.
                     LSTM also performed well, and GRU showed slightly lower accuracy, precision, and recall
                     performance.
                  
                  
                        Table 1. Comparison of three models based on accuracy, recall, and precision.
 
                  
                        Fig. 3. Confusion matrix for BERT.
 
                
               
                     4.2 Performance Analysis of the BERT Model
                  'Earthquake' classes were misclassified as 'Flood' or 'Other disaster'.
                  Similarly, the 'Flood' class had some misclassifications, with a few instances being
                     predicted as 'Accident' or 'Other disaster'. The 'Other disaster' class also had a
                     few misclassifications, with some instances being predicted as 'Accident', 'Earthquake',
                     or 'Flood'.
                  
                  Overall, BERT demonstrated its effectiveness in disaster classification, achieving
                     high accuracy in predicting the majority of instances correctly. Nevertheless, there
                     is still room for improvement, particularly in reducing misclassifications between
                     certain classes.
                  
                  The history plot of the BERT model validation and training accuracy over 20 epochs
                     reveals interesting trends, as shown in Fig. 4. Initially, the validation accuracy started at 94 % but experienced a significant
                     jump to 96 % at the 3$^{\mathrm{rd}}$ epoch. Subsequently, the validation accuracy
                     remains constant throughout the remaining epochs. On the other hand, the training
                     accuracy started at 88 % and increased steadily to 98 % by the 3$^{\mathrm{rd}}$ epoch.
                     Subsequently, the training accuracy continued to increase slightly, reaching 99.2
                     %. This suggests that the BERT model performs well in terms of training and validation
                     accuracy, with the validation accuracy showing stability after an initial improvement.
                     The consistent increase in training accuracy indicates that the model is effective
                     in learning and improving its performance over time.
                  
                  An eight-epoch comparison of three model variants was performed using BERT for disaster
                     tweet classification. The objective was to analyze the influence of the number of
                     hidden layers on the model performance. The original model, featuring one hidden layer,
                     exhibited impressive progress during training, consistently enhancing accuracy, precision,
                     recall, and F1-score on the validation dataset. This highlighted the proficiency of
                     the model in capturing the essential patterns from the text data.
                  
                  In Variant 1, designed with two hidden layers, the performance of the model was competitive.
                     Despite an initial metric lag compared to the original model, rapid convergence led
                     to commendable evaluation scores. This suggests that the additional hidden layers
                     facilitated nuanced pattern recognition, contributing to either equivalent or improved
                     outcomes.
                  
                  Variant 2, leveraging four hidden layers, demonstrated swift pattern discernment and
                     efficient convergence. Despite its increased complexity, this architecture achieved
                     notable precision, recall, and F1-score values, underscoring its capability to learn
                     intricate text features.
                  
                  These observations underscore the interplay between the number of hidden layers and
                     model performance. Both simpler and deeper architectures yielded promising results,
                     potentially due to the enhanced feature extraction capabilities. On the other hand,
                     careful consideration of overfitting risks is essential when adjusting the model.
                     In summary, this analysis, conducted using BERT for disaster tweet classification,
                     sheds light on the impact of hidden layer variations, offering valuable insights for
                     architectural decisions in natural language processing tasks.
                  
                  
                        Fig. 4. Plot illustrating the validation and training accuracy of the BERT model over 20 epochs.
 
                
             
            
                  Conclusion and Future Work
               This study analyzed the efficacy of BERT, GRU, and LSTM deep learning models in classifying
                  disaster-related tweets. The results showcased the superior performance of BERT in
                  precision, recall, and accuracy. This highlights the potential BERT for improved disaster
                  management by analyzing tweets, identifying the disaster type, and formulating appropriate
                  response strategies. The study also highlighted the importance of location information
                  in disaster management and the varied word usage based on the type of disaster.
               
               The study provides promising insights. Therefore, future research should extend to
                  different disaster types, such as wildfires or pandemics, to explore the adaptability
                  of these models. In addition, how these models can integrate with current disaster
                  management systems for improved efficiency will also be a subject for future research.
                  Furthermore, as these models advance, ethical considerations of content filtration
                  and information prioritization should be evaluated to ensure responsible and transparent
                  utilization that does not infringe on ethical norms or human rights.
               
             
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This work was supported in part by the National Research Foundation of Korea (NRF)
                  grant funded by the Korean government (MSIT) (No.2022R1A2C1003549) and in part by
                  the 2023 Hongik University Innovation Support Program Fund.
                  			
               
             
            
                  
                     REFERENCES
                  
                     
                        
                        Zou, Lei, Danqing Liao, Nina SN Lam, Michelle A. Meyer, Nasir G. Gharaibeh, Heng Cai,
                           Bing Zhou, and Dongying Li. "Social media for emergency rescue: An analysis of rescue
                           requests on Twitter during Hurricane Harvey." International Journal of Disaster Risk
                           Reduction 85 (2023): 103513.

 
                     
                        
                        Karimiziarani, Mohammadsepehr, Keighobad Jafarzadegan, Peyman Abbaszadeh, Wanyun Shao,
                           and Hamid Moradkhani. "Hazard risk awareness and disaster management: Extracting the
                           information content of twitter data." Sustainable Cities and Society 77 (2022): 103577.

 
                     
                        
                        Samuel, Jim, G. G. Md. Nawaz Ali, Md. Mokhlesur Rahman, Ek Esawi, and Yana Samuel.
                           2020. "COVID-19 Public Sentiment Insights and Machine Learning for Tweets Classification"
                           Information 11, no. 6: 314.

 
                     
                        
                        Piyush Jain, Sean C.P. Coogan, Sriram Ganapathi Subramanian, Mark Crowley, Steve Taylor,
                           and Mike D. Flannigan. 2020. A review of machine learning applications in wildfire
                           science and management. Environmental Reviews. 28(4): 478-505.

 
                     
                        
                        Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., Tesconi, M. (2016). The paradigm-shift
                           of social spambots: Evidence, theories, and tools for the arms race. IEEE Communications
                           Magazine, 54(3), 100-107.

 
                     
                        
                        Imran, M., Elbassuoni, S. M., Castillo, C., Diaz, F., Meier, P. (2016). Practical
                           extraction of disaster-relevant information from social media. Proceedings of the
                           39th International ACM SIGIR Conference on Research and Development in Information
                           Retrieval, 1023-1026.

 
                     
                        
                        R. Ni and H. Cao, "Sentiment Analysis based on GloVe and LSTM-GRU," 2020 39th Chinese
                           Control Conference (CCC), Shenyang, China, 2020, pp. 7492-7497.

 
                     
                        
                        Ashktorab, Zahra, Christopher Brown, Manojit Nandi, and Aron Culotta. "Tweedr: Mining
                           twitter to inform disaster response." In ISCRAM, pp. 269-272. 2014.

 
                     
                        
                        Nguyen, Dong, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. "" How old do you think
                           I am?" A study of language and age in Twitter." In Proceedings of the International
                           AAAI Conference on Web and Social Media, vol. 7, no. 1, pp. 439-448. 2013.

 
                     
                        
                        Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pretraining
                           of deep bidirectional transformers for language understanding" (2018).

 
                     
                        
                        Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,&
                           Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
                           machine translation.

 
                     
                        
                        Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining
                           of deep bidirectional transformers for language understanding, 2018.

 
                   
                
             
            Author
            
            
               			Ihsan Ullah  received his B.S. in Computer Systems Engineering from the University
               of Engineering and Technology Peshawar, Pakistan, and his M.S. in Computer and Wireless
               Networks from COMSATS University, Islamabad, in 2021. He was a research assistant
               in the Wireless and Communication lab for half a year. He is pursuing his Ph.D. in
               Software and Communication Engineering at Hongik University, South Korea, under Prof.
               Byung-Seo Kim. His research interests encompass NDN, Underwater Wireless Sensor Networks,
               Cloud and Fog Computing, Vehicular Networks, and aspects of Machine Learning and Artificial
               Intelligence.
               		
            
            
            
               			Anum Jamil  is a final-year B.S. student in Applied Physics at NED University of
               Engineering and Technology, Karachi. She is currently interning at the university's
               Smart City Lab. She is also engaged with the distinguished President's Initiative
               of Artificial Intelligence, demonstrating her dedication to machine learning and AI.
               Her research interests lie in Natural Language Processing (NLP), the Internet of Things
               (IoT), and Artificial Intelligence.
               		
            
            
            
               			Imtiaz ul Hassan  holds a B.S. degree in Computer Systems Engineering from the
               University of Engineering and Technology Peshawar, Pakistan. Currently, he is pursuing
               his M.S. in Data Science from NEDUET Karachi. In addition to his studies, Imtiaz is
               actively engaged as a research associate in the Smart City LAB at the National Center
               for Artificial Intelligence. His research interests primarily involve computer vision,
               natural language processing (NLP), autonomous vehicles, and robotics.
               		
            
            
            
               			Byung-Seo Kim  received his B.S. degree in Electrical Engineering from In-Ha University,
               In-Chon, Korea in 1998 and his M.S. and Ph.D. degrees in Electrical and Computer Engi-neering
               from the University of Florida in 2001 and 2004, respectively. His Ph.D. study was
               supervised by Dr. Yuguang Fang. Between 1997 and 1999, he worked for Motorola Korea
               Ltd., PaJu, Korea, as a CIM Engineer in ATR&D. From January 2005 to August 2007, he
               worked for Motorola Inc., Schaumburg, Illinois, as a Senior Software Engineer in Networks
               and Enterprises for designing the protocol and network architecture of wireless broadband
               mission-critical communications. He is a professor in the Department of Software and
               Communications Engineering at Hongik University, Korea. He is an IEEE Senior Member
               and is an Associative Editor of IEEE Access, Telecommunication Systems, and Journal
               of the Institute of Electrics and Information Engineers. His studies have appeared
               in approximately 260 publications and 32 patents. His research interests include designing
               and developing efficient wireless/wired networks, including link-adaptable/cross-layer-based
               protocols, multi-protocol structures, wireless CCNs/NDNs, Mobile Edge Computing, physical
               layer design for broadband PLC, and resource allocation algorithms for wireless networks.