Authors:
(1) Dinesh Kumar Vishwakarma, Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Delhi, India;
(2) Mayank Jindal, Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Delhi, India
(3) Ayush Mittal, Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Delhi, India
(4) Aditya Sharma, Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Delhi, India.
Table of Links
- Abstract and Intro
- Background and Related Work
- EMTD Dataset
- Proposed Methodology
- Experiments
- Conclusion and References
2. Background and Related Work
This section discusses the past methodologies for movie genre classification and the motivations behind our study. Video content is majorly partitioned to (1) Video frames (Images) and (2) Audio (Speech {dialogues} + Non speech {vocals}). To analyze the video content, various studies are done in the past, focusing chiefly upon cognitive [3]–[7] or affective [8] levels individually. For a more effective study, both levels need to be taken into account to perform better in a genre classification task.
In the past studies, many cognition-based approaches have been proposed based upon low-level features, including visual disturbances, average shot length, gradual change in light intensity in video frames, and peaks in audio waveform [3], to capture scene components [4]. Other features used for cognitive classification include RGB colors in frames [6], film shots [7], shot length [9], type of background in scenes (dark/non-dark) [6], etc. Similarly, some approaches are proposed for only affective analysis [8].
A movie can have multiple genres depicting a lot of information to viewers thus also serve as a task to recommend a movie to a viewer. Jain et al. [5] used 4 video features (shot length, motion, color dominance, lighting key) and 5 audio features to classify movie clips using only 200 training samples. They used complete movie clips to predict genres. However, the study uses only 200 training samples for training their model. Accordingly, the accuracy reported by them might be due to over-fitting. Also, the study focused only on single-label classification. Huang et al. [4] proposed the Self Adaptive Harmony Search algorithm with 7 stacked SVM’s that used both audio and visual features (about 277 features in total) on a 223 sized dataset. Ertugrul et al. [10] used low-level features, including the movies' plot, by breaking the plot into sentences and classifying sentences into genres and taking the final genre to be one with maximum occurrence. Pais et al. [11] proposed to fuse image-text features by relying on some important words from the overall synopsis and performed movie genre classification based on those features. The model was tested on a set of 107 movie trailers. Shahin et al. [12] used movie plots and quotes and proposed Hierarchical attention networks to classify genres. Similarly, Kumar et al. [13] proposed to use movie plots to classify genre using hash vectorization by focusing on reducing overall time complexity. The above-mentioned studies rely on low-level features and do not capture any high-level features from movie trailers, thus cannot be relied upon for a good level recognition system.
From more recent studies, many researchers used deep networks for movie genre classification tasks. Shambharkar et al. [14] proposed a single label 3D CNN-based architecture to seize the spatial and temporal features. Though spatial and temporal features are captured in this, the model is not robust due to single-label classification. Some researchers have worked on movie posters to classify movie genres. Chu et al. [15] formulated a deep neural network to facilitate object detection and visual appearances. Though work captured a lot of information from posters, the poster itself is not enough to completely describe a movie. Simoes et al. [16] proposed a CNN-Motion that included scene histograms provided by the unsupervised clustering algorithm, weighted genre predictions for each trailer, along with some low-level video features. This provided a major group of features from a video but lacked some affective and cognitive-based features to classify genre.
Thus, from the past literature, it is evident that major information should be extracted from the video trailers for cognitive as well as affective study. So, our motivation behind the work is to device an approach relying on both levels of video content analysis as in [1]. We believe that the proposed architecture and the model are novel and robust and can be used in the future for various research perspectives.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.