Image Caption Generation: A Survey

Khushboo Khurana, Shyamal Mundada

Open Access Article Go Back

Image Caption Generation: A Survey

Khushboo Khurana¹ , Shyamal Mundada²

Computer Science and Engineering Department, Shri Ramdeobaba College of Engineering and Management, Nagpur, India.
Computer Science and Engineering Department, Shri Ramdeobaba College of Engineering and Management, Nagpur, India.

Section:Survey Paper, Product Type: Journal Paper
Volume-6 , Issue-3 , Page no. 256-262, Mar-2018

CrossRef-DOI: https://doi.org/10.26438/ijcse/v6i3.256262

Online published on Mar 30, 2018

Copyright © Khushboo Khurana, Shyamal Mundada . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Citation

IEEE Style Citation: Khushboo Khurana, Shyamal Mundada, “Image Caption Generation: A Survey,” International Journal of Computer Sciences and Engineering, Vol.6, Issue.3, pp.256-262, 2018.

MLA Citation

MLA Style Citation: Khushboo Khurana, Shyamal Mundada "Image Caption Generation: A Survey." International Journal of Computer Sciences and Engineering 6.3 (2018): 256-262.

APA Citation

APA Style Citation: Khushboo Khurana, Shyamal Mundada, (2018). Image Caption Generation: A Survey. International Journal of Computer Sciences and Engineering, 6(3), 256-262.

BibTex Citation

BibTex Style Citation:
@article{Khurana_2018,
author = {Khushboo Khurana, Shyamal Mundada},
title = {Image Caption Generation: A Survey},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {3 2018},
volume = {6},
Issue = {3},
month = {3},
year = {2018},
issn = {2347-2693},
pages = {256-262},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=1793},
doi = {https://doi.org/10.26438/ijcse/v6i3.256262}
publisher = {IJCSE, Indore, INDIA},
}

RIS Citation

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i3.256262}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=1793
TI - Image Caption Generation: A Survey
T2 - International Journal of Computer Sciences and Engineering
AU - Khushboo Khurana, Shyamal Mundada
PY - 2018
DA - 2018/03/30
PB - IJCSE, Indore, INDIA
SP - 256-262
IS - 3
VL - 6
SN - 2347-2693
ER -

VIEWS	PDF	XML
1127	841 downloads	304 downloads

Bar Line

Abstract

In recent years, the amount of images are increasing due to advancement in the technologies. This proliferation in Image data demands analysis and generation of image descriptions. The generated image captions must describe the image in a precise manner, covering all aspects of the image. Automatic image caption generation has emerged as an important task of research in the new integrated community of language-vision. Image captioning techniques can be broadly divided into data-driven and feature-driven methods. Data-driven techniques involve extraction of a similar image, whose caption is as it is copied or extraction of multiple images and then combining their captions to form appropriate caption for the input image. Feature based methods involve analyzing the visual content of the image and then generating natural language sentences. In this paper, we have reviewed both the methods along with the most efficient feature based technique that uses Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Various CNN-RNN based techniques are proposed by researchers to solve the Image captioning task and have achieved remarkable results. First, image objects and their relations are analyzed using CNN and then RNN are used for sentence generation. We have also elaborated the concept of CNN.

Key-Words / Index Term

Image captioning, image understanding, CNN, RNN, natural language generation

References

[1] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu, “I2T: Image parsing to text description,” in Proc. IEEE, vol. 98, no. 8, pp. 1485– 1508, Aug. 2010.
[2] Y. Feng and M. Lapata, "Automatic Caption Generation for News Images," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 4, pp. 797-812, April 2013.
[3] G. Kulkarni et al., "BabyTalk: Understanding and Generating Simple Image Descriptions," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891-2903, Dec. 2013.
[4] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, and T.L. Berg, “Baby Talk: Understanding and Generating Image Descriptions,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1601-1608, 2011.
[5] A. Farhadi “Every picture tells a story: Generating sentences from images,” in Proc. 11th Eur. Conf. Comput. Vis.: Part IV, 2010, pp. 15–29.
[6] R. Kiros and R. Z. R. Salakhutdinov, “Multimodal neural language models,” in Proc. Neural Inf. Process. Syst. Deep Learn. Workshop, 2013.
[7] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille, “Explain images with multimodal recurrent neural networks,” in arXiv:1410.1090, 2014.
[8] A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 664-676, April 1 2017.
[9] Y. Yang, C. L. Teo, H. Daume III, and Y. Aloimonos, “Corpusguided sentence generation of natural images,” in Proc. Conf. Empirical Methods Natural Language Process., 2011, pp. 444–454.
[10] D. Elliott and F. Keller, “Image description using visual dependency representations,” in Proc. Empirical Methods Natural Language Process., 2013, pp. 1292–1302.
[11] A. Gupta and P. Mannem, “From image annotation to image description,” in Neural Information Processing. Berlin, Germany: Springer, 2012
[12] J. Donahue et al., "Long-Term Recurrent Convolutional Networks for Visual Recognition and Description," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 677-691, April 1 2017.
[13] P. He´de, P.A. Moellic, J. Bourgeoys, M. Joint, and C. Thomas, “Automatic Generation of Natural Language Descriptions for Images,” Proc. Recherche d ’Information Assistee par Ordinateur, 2004.
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” in Neural Computation. Cambridge, MA, USA: MIT Press, 1997.
[15] A. Tariq and H. Foroosh, "A Context-Driven Extractive Framework for Generating Realistic Image Descriptions," in IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 619-632, Feb. 2017.
[16] K. Fu, J. Jin, R. Cui, F. Sha and C. Zhang, "Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2321-2334, Dec. 1 2017.
[17] Z. Shi and Z. Zou, "Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?," in IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 6, pp. 3623-3634, June 2017.
[18] A. Muscat and A. Belz, "Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations," in IEEE Computational Intelligence Magazine, vol. 12, no. 3, pp. 29-42, Aug. 2017.
[19] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos, “Corpus-guided sentence generation of natural images,” in Proc. 16th Conf. on Empirical Methods in Natural Language Processing, Edinburg, Scotland, July 27-31, 2011, pp. 444–454.
[20] D. Elliott and F. Keller, “Image description using visual dependency representations,” in Proc. 18th Conf. on Empirical Methods in Natural Language Processing, Seattle, Oct. 18-21, 2013, pp. 1292–1302.
[21] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi , T. Berg, K. Stratos, and H. Daumé, III, “Midge: Generating image descriptions from computer vision detections,” in Proc. 13th Conf. of the European Chapter of the Association for Computational Linguistics, Avignon, France, Apr. 23-27, 2012, pp. 747–756.
[22] P. Kuznetsova, V. Ordonez, T. L. Berg, and Y. Choi, “Treetalk: Composition and compression of trees for image descriptions,” Trans. Assoc. Comput. Linguist., vol. 2, no.10, pp. 351–362, June 2014.
[23] L. Gao, Z. Guo, H. Zhang, X. Xu and H. T. Shen, "Video Captioning With Attention-Based LSTM and Semantic Consistency," in IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2045-2055, Sept. 2017.
[24] M. Kilickaya, B. K. Akkus, R. Cakici, A. Erdem, E. Erdem and N. Ikizler-Cinbis, "Data-driven image captioning via salient region discovery," in IET Computer Vision, vol. 11, no. 6, pp. 398-406, 9 2017.
[25] L. Li, S. Tang, Y. Zhang, L. Deng and Q. Tian, "GLA: Global–Local Attention for Image Description," in IEEE Transactions on Multimedia, vol. 20, no. 3, pp. 726-737, March 2018.
[26] L. Yang and H. Hu, "TVPRNN for image caption generation," in Electronics Letters, vol. 53, no. 22, pp. 1471-1473, 10 26 2017.
[27] H. Fang et al., "From captions to visual concepts and back," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1473-1482.
[28] J. Devlin, H. Cheng, H. Fang, S Gupta, L Deng, X. He, G. Zweig, M. Mitchell, “Language Models for Image Captioning: The Quirks and What Works” Cornell University Library, Oct.-2015.
[29] X. He and L. Deng, "Deep Learning for Image-to-Text Generation: A Technical Overview," in IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 109-116, Nov. 2017.
[30] J. Wu and H. Hu, "Cascade recurrent neural network for image caption generation," in Electronics Letters, vol. 53, no. 25, pp. 1642-1643, 12 7 2017. doi: 10.1049/el.2017.3159
[31] S. Cascianelli, G. Costante, T. A. Ciarfuglia, P. Valigi and M. L. Fravolini, "Full-GRU Natural Language Video Description for Service Robotics Applications," in IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 841-848, April 2018.
[32] M. Banko, V. Mittal, and M. Witbrock, “Headline Generation Based on Statistical Translation,” Proc. 38th Ann. Meeting Assoc. for Computational Linguistics, pp. 318-325, 2000.
[33] K. He, X. Zhang, S. Ren and J. Sun, "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904-1916, Sept. 1 2015.
[34] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin and S. Belongie, "Kernel Pooling for Convolutional Neural Networks," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 3049-3058.
[35] G, T. Jain “Discriminatory Image Caption Generation Based on Recurrent Neural Networks and Ranking Objective,” International Journal of Computer Science and Engineering, Vol 5 (10), pp.260-265, Oct 2017.

Citations	8797
h-index	34
i10-index	152

Impact Factor :	3.802
ISSN :	2347-2693 (Online)