By: Sejal Jain, Saloni Agarwal, Sonam Gour, Sanjivani Sharma, and Shrishti Agarwal
Image caption generation is a software technology that takes an image as an input and produces a descriptive caption in text form. In the modern era, this technology finds application in various fields. For instance, automatically generating captions for medical images aids in diagnosis and enhances reporting efficiency, helping healthcare professionals to quickly interpret complex visuals. In the realm of autonomous vehicles, image captioning enables these vehicles to understand and communicate about their surroundings, thereby improving safety and navigation. Furthermore, in journalism, generating captions for news images can enhance comprehension and engagement for readers. This paper will provide an overview of the technologies that can be used to develop an image caption generator using the Flickr8K dataset from Kaggle. The implementation includes various tools like OpenCV, which is widely utilized by leading tech companies, such as Google and Microsoft. The paper also includes snapshots of the generated outputs to illustrate the model’s effectiveness. The primary aim of this implementation is to gain insights into the practical use of these tools and technologies in real-world projects.
Keywords: Image caption, Python, neural networks, ResNet, OpenCV, long short-term memory (LSTM), Keras, deep learning etc.
Citation:
Refrences:
- Liu S, Bai L, Hu Y, Wang H. Image captioning based on deep neural networks. In: MATEC web of conferences. 2018. pp. 01052). EDP Sciences.
- Farhadi, A. et al. (2010). Every Picture Tells a Story: Generating Sentences from Images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds) Computer Vision – ECCV 2010. ECCV 2010. Lecture Notes in Computer Science, vol 6314. Heidelberg: Springer, Berlin; pp. 15–29. https://doi.org/10.1007/978-3-642-15561-1_2
- Feng Y, Lapata M. Automatic caption generation for news images. IEEE Trans Patt Anal Machine Intell. 2012;35(4):797–812.
- Chen J, Dong W, Li M. Image caption generator based on deep neural networks. 2014. Available from: https://www.cs.ubc.ca/~carenini/teaching/cpsc503-19/final-projects-2016/image_caption_generator_final_report.pdf
- Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. pp. 3156–3164.
- Liu C, Wang C, Sun F, Rui Y. Image2Text: a multimodal image captioner. In: Proceedings of the 24th ACM international conference on Multimedia (MM ’16). Association for Computing Machinery. New York, NY. 746–748. https://doi.org/10.1145/2964284.2973831
- Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncert Fuzz Knowledge-Based Sys. 1998;6(02):107–16.
- Tanti M, Gatt A, Camilleri KP. Where to put the image in an image caption generator. Nat Lang Eng. 2018;24(3):467–89.
- Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K. Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision. 2015. pp. 4534–4542.
- Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. 2014 Jun 3.
- Yang L, Hu H. TVPRNN for image caption generation. Electron Lett. 2017;53(22):1471–3.
- Blandfort P, Karayil T, Borth D, Dengel A. Image captioning in the wild: how people caption images on Flickr. In: Proceedings of the workshop on multimodal understanding of social, affective and subjective attributes. 2017 Oct 27. pp. 21–29.
- Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002 July. pp. 311–318.
- Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. 2015 Jun 1. pp. 2048–2057. PMLR.
- Papineni K. BLEU: a method for automatic evaluation of MT. Research Report, Computer Science RC22176 (W0109-022). 2001.
- Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005 June. pp. 65–72.
- Lin CY. Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. 2004 July. pp. 74–81.
- Vedantam R, Lawrence Zitnick C, Parikh D. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. pp. 4566–4575.
- Anderson P, Fernando B, Johnson M, Gould S. Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam: Springer International Publishing;. October 11–14, 2016. pp. 382–398).