12th IEEE Conference on Automatic Face and Gesture Recognition - FG 2017

May 14, 2017, 1:18 p.m.

The IEEE conference series on Automatic Face and Gesture Recognition is the premier international forum for research in image and video-based face, gesture, and body movement recognition. Our consortium of partners is constantly working on the integration of different components in order to ensure a communicative agent with social and cultural capabilities. Advances in fundamental computer vision, pattern recognition, computer graphics, and machine learning techniques relevant to face, gesture, and body action algorithms, and analysis of specific applications are key areas to be developed.

On May 30, two important papers will be presented at the 12th IEEE Conference on Automatic Face and Gesture Recognition in Washington, DC. Our experts and researchers Federico Sukno and Oriol Martinez are constantly working on computer vision and emotion recognition. The following papers demonstrate the international relevance of their work:

  • Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database A. Fernandez-Lopez, O. Martinez and F.M. Sukno. 12th IEEE International Conference on Face and Gesture Recognition, Washington, DC, USA, 2017.

Speech is the most used communication method between humans and it involves the perception of auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, although the video can provide information that is complementary to the audio. Exploiting the visual information, however, has proven challenging. On one hand, researchers have reported that the mapping between phonemes and visemes (visual units) is one-to-many because there are phonemes which are visually similar and indistinguishable between them. On the other hand, it is known that some people are very good lip-readers (e.g.: deaf people). We study the limit of visual only speech recognition in controlled conditions. With this goal, we designed a new database in which the speakers are aware of being read and aim to facilitate lip-reading.
In the literature, there are discrepancies on whether hearing-impaired people are better lip-readers than normal-hearing people. Then, we analyse if there are differences between the lip-reading abilities of 9 hearing-impaired and 15 normal hearing people. Finally, human abilities are compared with the performance of a visual automatic speech recognition system. In our tests, hearing-impaired participants outperformed the normal-hearing participants but without reaching statistical significance. Human observers were able to decode 44% of the spoken message. In contrast, the visual only automatic system achieved 20% of word recognition rate. However, if we repeat the comparison in terms of phonemes both obtained very similar recognition rates, just above 50%. This suggests that the gap between human lip-reading and automatic speechreading might be more related to the use of context than to the ability to interpret mouth appearance.


  • Fusion of Valence and Arousal Annotations through Dynamic Subjective Ordinal Modelling A. Ruiz, O. Martinez, X. Binefa and F.M. Sukno  12th IEEE International Conference on Face and Gesture Recognition, Washington, DC, USA, 2017.

An essential issue when training and validating computer vision systems for affect analysis is how to obtain reliable ground-truth labels from a pool of subjective annotations. In this paper, we address this problem when labels are given in an ordinal scale and annotated items are structured as temporal sequences. This problem is of special importance in affective computing, where collected data is typically formed by videos of human interactions annotated according to the Valence and Arousal (V-A) dimensions. Moreover, recent works have shown that inter-observer agreement of V-A annotations can be considerably improved if these are given in a discrete ordinal scale. In this context, we propose a novel framework which explicitly introduces ordinal constraints to model the subjective perception of annotators. We also incorporate dynamic information to take into account temporal correlations between ground-truth labels. In our experiments over synthetic and real data with V-A annotations, we show that the proposed method outperforms alternative approaches which do not take into account either the ordinal structure of labels or their temporal correlation.

Many different topics will be discussed during the upcoming days, such as :


  • Face Recognition, Analysis & Synthesis: tracking/detection, recognition, expression analysis & synthesis, 3D analysis & synthesis, lip reading
  • Gesture Recognition, Analysis & Synthesis: gesture interpretation, head motion tracking, arm/limb analysis & synthesis, human body tracking, vision based human interfaces, life-like avatars
  • Psychological and Behavioral Analysis: facial behavior analysis, gesture recognition and variation, body gait movement analysis, measurement
  • Technologies and Applications: biometric applications, sensors, surveillance applications, feature selection and analysis, fusion of multiple modalities, human computer/robot interaction, socially-ware computing, reality mining
  • Body Action and Activity Recognition: body motion analysis, body movement analysis, gait recognition, human action and activity recognition

See you in Washington, DC!
Follow the presentation of our papers on our social media: Twitter and Facebook

multilingual intelligent embodied agent social competence adaptive dialogue expressive speech recognition and synthesis discourse generation vocal facial and gestural social and emotional cues