Abstract
Deep falsification of multimedia content, especially videos and photos, threatens social cohesion (e.g., rumour propagation,
extortion, and truth distortion) and must not be ignored. In some cases, this issue requires effective detection
solutions. Most studies suggest that convolutional neural networks (CNNs) may not be able to extract complex features
like those used in deepfake production. Thus, hybrid approaches that can capture complex features and act as powerful
descriptors for binary classification are needed to separate bogus from true content. In this paper, a hybrid algorithm is
developed to combine gated recurrent units (GRU) and CNN. The proposed model aims to improve the extraction of
complex features by simultaneously capturing instantaneous and spatial features. This approach permits the extraction
of implicit features that are vital to the final classification process, especially when dealing with a sequential series within
video content. Finally, a dense neural network is used to classify these features. Practically, two data sets were used
to train the proposed model: the FaceForensics++ (FF++) and DeepFake Detection Challenge (DFDC) datasets. The
evaluation results of the proposed model on the FF++ dataset for the Area Under the Curve (AUC) and F1-score metrics
reached 0.88% and 0.85%, respectively. While DFDC is 0.95% and 0.86% for the same metrics, respectively.