Deepfake Video Detection in the Compressed Domain
Deepfake Video Detection in the Compressed Domain
Ever since the dot com boom, video has been the most important means of mass communication. According to a recent study from Cisco, videos are expected to reach 82% of the internet traffic by the end of 2022. However, unlike images, video analysis is computationally costly as it demands extracting visual information from the vast amount of pixel values. With such a significant impact on our day-to-day life, it is clear that video understanding will shape the future if we can make it more practical and useful for real-time usage. 2D Convolution Neural Networks (2D CNNs) despite demonstrating state-of-the-art results in terms of accuracy and efficiency for various image understanding tasks, did not achieve great success in video analytics. One possible reason behind the failure is that video signals are quite complex and contain high temporal redundancy. Thus extracting rich spatio-temporal features from the signals to learn the key characteristics of the videos is challenging. However, to remove the spatial and temporal redundancy in the bit-stream, videos are usually streamed and stored in their compressed representation rather than the raw format. Hence, it is intuitive that it would be much more practical if 2D-CNNs learned the visual information directly from the compressed representation of the videos instead of reconstructing the RGB frames. We propose a Deepfake Video Detection Network (DVDNet) that extracts the key information confined in small numbers of frames called Intracoded frames (I-frames) from the compressed representations of the videos. Compared to other techniques that leverage I-frames, motion vectors, and residuals in the compressed domain, our work requires less modality to understand the data. Significant pre-processing time and computational demand reduction by solely working on I-frames make our approach more accessible for a broader range of machines and uses. In our work, we primarily focused on one of the most critical tasks of computer vision in video understanding - deepfake video detection. Synthesizing manipulated videos using Deep Learning (called deepfake videos), is an emerging threat to humankind, thus very important to detect. Therefore, we extensively tested our proposed network architecture on the above-mentioned video understanding task. As a result, we achieved 95.88% accuracy and an AUC score of 0.978 on the celeb-df dataset and 87.50% accuracy and 0.97539 AUC score on the faceforensics++ dataset. Our key contribution is that we reduced computational complexity significantly while achieving 95.88% accuracy on the celeb-df dataset and 87.50% on the faceforensics++ dataset.