• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2022, Volume: 15, Issue: 43, Pages: 2325-2335

Original Article

Novel Transfer Learning Attitude for Automatic Video Captioning Using Deep Learning Models

Received Date:13 September 2022, Accepted Date:08 October 2022, Published Date:17 November 2022


Objectives: To generate the captions for the videos with less time complexity and high accuracy and also to create captions for each input video frame with particular timestamps. It will be utilized in the crime branch and hearingimpaired people will learn about the happenings of the video fruitfully. Methods: The proposed approach experiments with Transfer learning techniques. Modified Inception v3 and Resnet 50 networks are designed to compare the results. The standard MSVD Dataset is utilized to demonstrate the architectures. The performances are compared with the standard performance metrics. Findings: The inception v3 model works better than the Resnet 50 architecture for video captioning tasks. It provides the best accuracy at 99.83% with captions for the given input videos than Resnet 50 model. The MSVD dataset is more suitable for the demonstration of the video captioning task. Novelty: The two proposed models are modified based on the working of the video captioning tasks. The aggregation of some layers boosts the performance of the models more than ordinary models.

Keywords: Artificial Intelligence; Automatic Captioning; Transfer Learning; Frames; Inception V3; Residual Network50 Model


  1. Sasikala S, Ramesh S, Gomathi S, Balambigai S, Anbumani V. Transfer learning based recurrent neural network algorithm for linguistic analysis. Concurrency and Computation: Practice and Experience. 2022;34(5). Available from: https://doi.org/10.1002/cpe.6708
  2. Amirian S, Rasheed K, Taha TR, Arabnia HR. Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap. IEEE Access. 2020;8:218386–218400. Available from: https://doi:10.1109/ACCESS.2020.3042484
  3. Padmawar P, Borade R, Hol A. Video Captioning Using Neural Networks. International Journal for Research in Applied Science and Engineering Technology. 2022;10(5):1228. Available from: https://doi.org/10.22214/ijraset.2022.42506
  4. Tc, Phan AC, Phan HP, Cao TN, Trieu. Content-Based Video Big Data Retrieval with Extensive Features and Deep Learning. Applied Sciences. 2022;12:6753. Available from: https://doi.org/10.3390/app12136753
  5. Samleti S, Mishra A, Jhajhria A, Rai SK, Malik G. Real-Time Video Captioning Using Deep Learning. International Journal of Engineering Research & Technology (IJERT). 2021(12):360–366. Available from: https://doi:10.17577/IJERTV10IS120054
  6. Ji W, Wang R. A Multi-instance Multi-label Dual Learning Approach for Video Captioning. ACM Transactions on Multimedia Computing, Communications, and Applications. 2021;17(2s):1–18. Available from: https://doi.org/10.1145/3446792
  7. Eg, Özer IN, Karapınar S, Başbuğ S, Turan A, Utku MA, et al. Deep Learning based, a New Model for Video Captioning. International Journal of Advanced Computer Science and Applications. 2020;11(3). Available from: https://doi:10.14569/IJACSA.2020.0110365
  8. Zhao H, Chen Z, Guo L, Han Z. Video captioning based on vision transformer and reinforcement learning. PeerJ Computer Science. 8:e916. Available from: https://doi:10.7717/peerj-cs.916
  9. Hong R, Liu D, Mo X, He X, Zhang H. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(2):684–696. Available from: https://doi: 10.1109/TPAMI.2019.2911066
  10. Hou J, Wu X, Zhao W, Luo J, Jia Y. Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019. Available from: https:doi:10.1109/ICCV.2019.00901
  11. Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W. Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019. Available from: https://doi.org/10.48550/arXiv.1908.10072
  12. Rimle P, Dogan-Schonberger P, Gross M. Enriching Video Captions With Contextual Text. International Conference on Pattern Recognition (ICPR). 2021. Available from: https://doi: 10.1109/ICPR48806.2021.9412008
  13. Islam S, Dash A, Seum A, Raj AH, Hossain T, Shah FM. Exploring video captioning techniques: A comprehensive survey on deep learning methods. SN Computer Science. 2021;2(2). Available from: https://doi.org/10.1007/s42979-021-00487-x


© 2022 Vaishnavi & Narmatha. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)


Subscribe now for latest articles and news.