%0 Conference Proceedings %T DeepVANet: A Deep End-to-End Network for Multi-modal Emotion Recognition %+ Australian National University (ANU) %+ CSIRO Agriculture and Food (CSIRO) %+ North South University (NUS) %A Zhang, Yuhao %A Hossain, Md, Zakir %A Rahman, Shafin %Z Part 3: Human-Centered AI %< avec comité de lecture %@ 978-3-030-85612-0 %( Lecture Notes in Computer Science %B 18th IFIP Conference on Human-Computer Interaction (INTERACT) %C Bari, Italy %Y Carmelo Ardito %Y Rosa Lanzilotti %Y Alessio Malizia %Y Helen Petrie %Y Antonio Piccinno %Y Giuseppe Desolda %Y Kori Inkpen %I Springer International Publishing %3 Human-Computer Interaction – INTERACT 2021 %V LNCS-12934 %N Part III %P 227-237 %8 2021-08-30 %D 2021 %R 10.1007/978-3-030-85613-7_16 %K Emotion recognition %K Deep learning %K Physiological signals %Z Computer Science [cs]Conference papers %X Human facial expressions and bio-signals (e.g., electroencephalogram and electrocardiogram) play a vital role in emotion recognition. Recent approaches employ both vision-based and bio-sensing data to design multi-modal recognition systems. However, these approaches require tremendous domain-specific knowledge, complex pre-processing steps and fail to take full advantage of the end-to-end nature of deep learning techniques. This paper proposes a deep end-to-end framework, DeepVANet, for multi-modal valence-arousal-based emotion recognition that applies deep learning methods to extract face appearance features and bio-sensing features. We use convolutional long short-term memory (ConvLSTM) techniques in face appearance feature extraction to capture spatial and temporal information from face image sequences. Unlike conventional time or frequency domain features (e.g., spectral power and average signal intensity), we use a 1D convolutional neural network (Conv1D) to learn bio-sensing features automatically. In experiments, we evaluate our method using DEAP and MAHNOB-HCI datasets. Our proposed multi-modal framework successfully outperforms both single- and multi-modal methods achieving superior performance compared to state-of-the-art approaches and reaches as high as 99.22% correctness. %G English %Z TC 13 %2 https://inria.hal.science/hal-04292350/document %2 https://inria.hal.science/hal-04292350/file/520517_1_En_16_Chapter.pdf %L hal-04292350 %U https://inria.hal.science/hal-04292350 %~ IFIP-LNCS %~ IFIP %~ IFIP-TC13 %~ IFIP-INTERACT %~ IFIP-LNCS-12934