Cover
Vol. 20 No. 1 (2024)

Published: June 30, 2024

Pages: 195-205

Original Article

Deep Learning Video Prediction Based on Enhanced Skip Connection

Abstract

Video prediction theories have quickly progressed especially after a great revolution of deep learning methods. The prediction architectures based on pixel generation produced a blurry forecast, but it is preferred in many applications because this model is applied on frames only and does not need other support information like segmentation or flow mapping information making getting a suitable dataset very difficult. In this approach, we presented a novel end-to-end video forecasting framework to predict the dynamic relationship between pixels in time and space. The 3D CNN encoder is used for estimating the dynamic motion, while the decoder part is used to reconstruct the next frame based on adding 3DCNN CONVLSTM2D in skip connection. This novel representation of skip connection plays an important role in reducing the blur predicted and preserved the spatial and dynamic information. This leads to an increase in the accuracy of the whole model. The KITTI and Cityscapes are used in training and Caltech is applied in inference. The proposed framework has achieved a better quality in PSNR=33.14, MES=0.00101, SSIM=0.924, and a small number of parameters (2.3 M).

References

  1. K. Xu, L. Wen, G. Li, L. Bo, and Q. Huang, “Spatiotem- poral cnn for video object segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pp. 1379–1388, 2019.
  2. M. S. Pavel, H. Schulz, and S. Behnke, “Object class seg- mentation of rgb-d video using recurrent convolutional neural networks,” Neural Networks, vol. 88, pp. 105– 113, 2017.
  3. W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6536–6545, 2018.
  4. Q. M. Rahman, N. S¨underhauf, P. Corke, and F. Day- oub, “Fsnet: A failure detection framework for semantic segmentation,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3030–3037, 2022. 204 | Al Mokhtar & Dawwd
  5. L.-Y. Gui, Y.-X. Wang, D. Ramanan, and J. M. Moura, “Few-shot human motion prediction via meta-learning,” in Proceedings of the European Conference on Com- puter Vision (ECCV), pp. 432–450, 2018.
  6. Z. Chang, X. Zhang, S. Wang, S. Ma, Y. Ye, X. Xin- guang, and W. Gao, “Mau: A motion-aware unit for video prediction and beyond,” in Advances in Neural In- formation Processing Systems (M. Ranzato, A. Beygelz- imer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds.), vol. 34, pp. 26950–26962, Curran Associates, Inc., 2021.
  7. Z. Zou, R. Zhang, S. Shen, G. Pandey, P. Chakravarty, A. Parchami, and H. X. Liu, “Real-time full-stack traffic scene perception for autonomous driving with roadside cameras,” in 2022 International Conference on Robotics and Automation (ICRA), pp. 890–896, IEEE, 2022.
  8. S. Tulsiani, A. A. Efros, and J. Malik, “Multi-view con- sistency as supervisory signal for learning shape and pose prediction,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2897– 2905, 2018.
  9. X. Chen, Y. Jia, X. Tong, and Z. Li, “Research on pedes- trian detection and deepsort tracking in front of intel- ligent vehicle based on deep learning,” Sustainability, vol. 14, no. 15, p. 9281, 2022.
  10. L. Chen, I. Grimstead, D. Bell, J. Karanka, L. Dimond, P. James, L. Smith, and A. Edwardes, “Estimating ve- hicle and pedestrian activity from town and city traffic cameras,” Sensors, vol. 21, no. 13, p. 4564, 2021.
  11. P. Hewage, M. Trovati, E. Pereira, and A. Behera, “Deep learning-based effective fine-grained weather forecast- ing model,” Pattern Analysis and Applications, vol. 24, no. 1, pp. 343–366, 2021.
  12. A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine, “Stochastic adversarial video prediction,” arXiv preprint arXiv:1804.01523, 2018.
  13. E. L. Denton et al., “Unsupervised learning of disentan- gled representations from video,” Advances in neural information processing systems, vol. 30, 2017.
  14. R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video se- quence prediction,” arXiv preprint arXiv:1706.08033, 2017.
  15. S. Oprea, P. Martinez-Gonzalez, A. Garcia-Garcia, J. A. Castro-Vargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. Argyros, “A review on deep learning tech- niques for video prediction,” IEEE Transactions on Pat- tern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 2806–2826, 2020.
  16. K. Fan, C. Joung, and S. Baek, “Sequence-to-sequence video prediction by learning hierarchical representa- tions,” Applied Sciences, vol. 10, no. 22, p. 8288, 2020.
  17. B. Liu, Y. Chen, S. Liu, and H.-S. Kim, “Deep learning in latent space for video prediction and compression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 701–710, 2021.
  18. V.-T. Le and Y.-G. Kim, “Attention-based residual au- toencoder for video anomaly detection,” Applied Intelli- gence, vol. 53, no. 3, pp. 3240–3254, 2023.
  19. Y. Lu, K. M. Kumar, S. shahabeddin Nabavi, and Y. Wang, “Future frame prediction using convolutional vrnn for anomaly detection,” in 2019 16th IEEE Interna- tional Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8, IEEE, 2019.
  20. Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Deep con- volutional autoencoder-based lossy image compression,” in 2018 Picture Coding Symposium (PCS), pp. 253–257, IEEE, 2018.
  21. P. Desai, C. Sujatha, S. Chakraborty, S. Ansuman, S. Bhandari, and S. Kardiguddi, “Next frame predic- tion using convlstm,” in Journal of Physics: Conference Series, vol. 2161, p. 012024, IOP Publishing, 2022.
  22. Y. Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei-Fei, “Eidetic 3d lstm: A model for video predic- tion and beyond,” in International conference on learn- ing representations, 2018.
  23. W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” arXiv preprint arXiv:1605.08104, 2016.
  24. Z. Straka, T. Svoboda, and M. Hoffmann, “Precnet: Next-frame video prediction based on predictive coding,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  25. X. Ye and G.-A. Bilodeau, “A unified model for continu- ous conditional video prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3603–3612, 2023.
  26. Z. Gao, C. Tan, L. Wu, and S. Z. Li, “Simvp: Sim- pler yet better video prediction,” in Proceedings of the 205 | Al Mokhtar & Dawwd IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3170–3180, 2022.
  27. H. Wei, X. Yin, and P. Lin, “Novel video prediction for large-scale scene using optical flow,” arXiv preprint arXiv:1805.12243, 2018.
  28. R. Zhang, X. Shu, R. Yan, J. Zhang, and Y. Song, “Skip- attention encoder–decoder framework for human motion prediction,” Multimedia Systems, pp. 1–10, 2022.
  29. J. Cho, J. Lee, C. Oh, W. Song, and K. Sohn, “Wide and narrow: Video prediction from context and motion,” arXiv preprint arXiv:2110.11586, 2021.
  30. W. Lu, J. Cui, Y. Chang, and L. Zhang, “A video predic- tion method based on optical flow estimation and pixel generation,” IEEE Access, vol. 9, pp. 100395–100406, 2021.
  31. X. Ye and G.-A. Bilodeau, “Video prediction by efficient transformers,” Image and Vision Computing, vol. 130, p. 104612, 2023.
  32. J. Santokhi, P. Daga, J. Sarwar, A. Jordan, and E. Hewage, “Temporal autoencoder with u-net style skip-connections for frame prediction,” arXiv preprint arXiv:2011.12661, 2020.
  33. M. A. Yilmaz and A. M. Tekalp, “Effect of architectures and training methods on the performance of learned video frame prediction,” in 2019 IEEE International Conference on Image Processing (ICIP), pp. 4210–4214, IEEE, 2019.
  34. N. Shayanfar, V. Derhami, and M. Rezaeian, “Video pre- diction using multi-scale deep neural networks,” Journal of AI and Data Mining, vol. 10, no. 3, pp. 423–431, 2022.
  35. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231– 1237, 2013.
  36. M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi- scale video prediction beyond mean square error,” 2016.
  37. Z. Chang, X. Zhang, S. Wang, S. Ma, Y. Ye, and W. Gao, “Stae: A spatiotemporal auto-encoder for high-resolution video prediction,” in 2021 IEEE International Confer- ence on Multimedia and Expo (ICME), pp. 1–6, IEEE, 2021.
  38. M. Cordts, M. Omran, S. Ramos, T. Scharw¨achter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in CVPR Workshop on the Future of Datasets in Vision, vol. 2, sn, 2015.
  39. W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumout- sakos, “Contextvp: Fully context-aware video predic- tion,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  40. X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual mo- tion gan for future-flow embedded video prediction,” in proceedings of the IEEE international conference on computer vision, pp. 1744–1752, 2017.
  41. Z. Hao, X. Huang, and S. Belongie, “Controllable video generation with sparse trajectories,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7854–7863, 2018.
  42. Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow,” in Pro- ceedings of the IEEE international conference on com- puter vision, pp. 4463–4471, 2017.
  43. Y.-H. Kwon and M.-G. Park, “Predicting future frames using retrospective cycle gan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1811–1820, 2019.