Video Description Using Learning Multiple Features
- DOI
- 10.2991/itim-17.2017.34How to use a DOI?
- Keywords
- Video description, SIFT flow, VGG-16, mean pooling, LSTM
- Abstract
Generating descriptions for open-domain videos is a major challenge for computer vision due to the complex dynamics. In this paper, we propose a video description model based on multiple features. In the process of encoding, we exploit two complementary features. The spatial one is extracted from the raw frame by VGG-16 model. The temporal one is extracted from the SIFT flow image by a fine-tuned VGG-16 model. In the process of decoding, we further add the mean pooling feature which represents holistic feature of the video. For generating sentence of the video, we utilize two-layer LSTMs model to generate sentence about the video. We evaluate several variants of our model on the MSVD dataset for METEOR metrics. The experimental results show that our model can be beneficial for generating sequence about the video.
- Copyright
- © 2017, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Xin Xu AU - Chunping Liu AU - Haibin Liu AU - Yi Ji AU - Zhaohui Wang PY - 2017/08 DA - 2017/08 TI - Video Description Using Learning Multiple Features BT - Proceedings of the 2017 International Conference on Information Technology and Intelligent Manufacturing (ITIM 2017) PB - Atlantis Press SP - 137 EP - 140 SN - 1951-6851 UR - https://doi.org/10.2991/itim-17.2017.34 DO - 10.2991/itim-17.2017.34 ID - Xu2017/08 ER -