Multilevel Profiling of Situation and Dialogue-based Deep Networks: Abstract and Intro

cover
28 May 2024

Authors:

(1) Dinesh Kumar Vishwakarma, Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Delhi, India;

(2) Mayank Jindal, Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Delhi, India

(3) Ayush Mittal, Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Delhi, India

(4) Aditya Sharma, Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Delhi, India.

Abstract

Automated movie genre classification has emerged as an active and essential area of research and exploration. Short duration movie trailers provide useful insights about the movie as video content consists of the cognitive and the affective level features. Previous approaches were focused upon either cognitive or affective content analysis. In this paper, we propose a novel multi-modality: situation, dialogue, and metadata-based movie genre classification framework that takes both cognition and affect-based features into consideration. A pre-features fusion-based framework that takes into account: situation-based features from a regular snapshot of a trailer that includes nouns and verbs providing the useful affect-based mapping with the corresponding genres, dialogue (speech) based feature from audio, metadata which together provides the relevant information for cognitive and affect based video analysis. We also develop the English movie trailer dataset (EMTD), which contains 2000 Hollywood movie trailers belonging to five popular genres: Action, Romance, Comedy, Horror, and Science Fiction, and perform cross-validation on the standard LMTD-9 dataset for validating the proposed framework. The results demonstrate that the proposed methodology for movie genre classification has performed excellently as depicted by the F1 scores, precision, recall, and area under the precision-recall curves.

Key Words: Movie Genre Classification, Convolutional Neural Network, English movie trailer dataset, Multimodal data analysis.

1. Introduction

Movies are a great source of amusement for the audience, impacting society in numerous ways. Identifying the genre of a movie manually may vary due to an individual’s taste. Hence, automated movie genre prediction is an active area of research and exploration. Movie trailers are becoming a useful source for predicting the genres of the movie. They provide useful insights into the movie in a very short duration of time. Movie trailers consist of two types of content: cognitive content and affective content.

Cognitive content describes the composition of the events, objects, and persons in a particular video frame of the movie trailer, while Affective content describes the types of psychological features such as feelings or emotions in a movie trailer [1]. Examples of cognitive content comprise a playground, a building, a man, a dog, etc. Examples of affective content are feelings/emotions such as happiness, sadness, anger, etc. Both the cognitive and affect-based content provide prominent features for predicting the genres of the movie.

In this paper, we propose a novel multi-modality situation, dialogue, and metadata-based movie genre classification framework, which aims to predict movie genres using video, audio and metadata (plot/description) content of movie trailers. Our novel framework focuses on extracting both the cognitive and affective features from the movie trailer. For achieving this, a sentence (generated from situations) composed of relevant nouns and verbs is extracted from the video frame. Nouns give the relevant information about the cognitive content of the trailers, and verbs provide useful affect-based mapping with the corresponding genres. For example, the verbs such as laughing, giggling, tickling, etc. provide an affect-based mapping with the ‘comedy’ genre. The verbs such as attacking, beating, hitting, etc. provide an affect-based mapping with the ‘action’ genre. Along with situations, dialogue and metadata-based features additionally contribute to cognitive and affective content as they include event descriptions (cognitive content) and psychological features (affective content).

Just like the standard machine learning process, the work is carried out in multiple phases. The 1st phase is the dataset generation phase, where we generate the EMTD, which contains 2000 Hollywood movie trailers belonging to 5 popular genres: Action, Romance, Comedy, Horror, and Science Fiction. The 2nd phase involves pre-processing of video trailers where all repeated frames are removed and resized. The sentences containing important nouns and verbs are extracted from the useful frames. We also prepare the audio transcripts of movie trailers to get dialogues from trailers. In the 3rd phase, we design and train the proposed architecture, which extracts and learn the important features from the trailers. Finally, in the 4th phase, the performance of our proposed architecture is evaluated using the Area under the PrecisionRecall Curve (AU (PRC)) metric. The following are the significant contributions of our work:

  • We propose a novel EMTD (English Movie Trailer Dataset) containing Hollywood movie trailers of the English language belonging to five popular and distinct genres: Action, Romance, Comedy, Horror, and Science Fiction.

  • This work proposes a novel approach to predict movie genres using cognitive and affect-based features. None of the previous literature has focused on a combination of dialogue, situation, and metadata-based features extracted from the movie trailers to the best of our knowledge. Hence, we perform: situation-based analysis using nouns and verbs, dialogue-based analysis using speech recognition, and metadata-based analysis with metadata available with trailers.

  • The proposed architecture is also evaluated by performing cross-dataset testing on the standard LMTD-9 [2] dataset. The results show that the proposed architecture has performed excellently and demonstrates the superior performance of the framework.

The remaining portion of the paper is organized as: In Section 2, the past literature on movie genre classification is reviewed, and the motivation behind the proposed work is highlighted. In Section 3, we discuss the proposed EMTD. In Section 4, we provide a detailed description of the proposed architecture. In Section 5, we evaluate the performance of the proposed framework and validate it against two different datasets. The paper is concluded in Section 6.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.