KRISTINA is getting deeply emotional

Sept. 27, 2016, by Mireia Farrús

With the first prototype of KRISTINA is nearing its completion, it seems like an appropriate time to talk about the technical side of the project. As the project’s description states it is “KRISTINA’s overall objective [..] to develop a human-like socially competent and communicative agent that […] run[s] on mobile communication devices”. To create an agent with human-like communication skills we need to analyze the users input not only with respect to the semantics of his conversational contributions but also to his overall affective state in order to react accordingly. This goal is equally challenging from a scientific point of view as it is from an engineering perspective. To classify the users’ emotion as accurate as possible we need to extract cues from multiple modalities like facial expressions, voice or gestures and fuse them to one overall estimation. Those analyses are requiring a chain of processing steps that can not only be quite performance heavy but are also developed by various researches using different tools. Therefore, a number of questions arise when thinking about the implementation:

“What can we use as a foundation to combine the solutions of multiple partners into one software?”
“How can we ensure that the system is easily adaptable to future research?”
“How can we realize all this on a mobile device?”

While those questions are merely representing a fraction of the challenges we need to overcome during the development of KRISTINA, they may be the most decisive ones to answer. The following article should give an insight into our approach on tackling those problems.

The cornerstone for the whole implementation is built by the Social Signal Interpretation (SSI) framework [1]. The SSI offers tools to record, analyze and recognize human behavior, such as gestures, mimics, head nods, and emotional speech in real-time. Following a patch-based design pipelines are set up from autonomic components and allow the parallel and synchronized processing of sensor data from multiple input devices. Within the KRISTINA project, SSI is in the central instance used to train machine learning models and integrate those into the technical environment to analyze video and audio input from the user with respect to displayed emotions. Furthermore, it serves as a data distribution center which establishes connections to other modules like speech to text or language analysis.

While the SSI has been (and still is) used in various EU projects, such as Tardis, Ilhaire, CEEDS and Aria Valuspa we introduced two novelties to the framework to adapt it for the specific needs of KRISTINA.

The infrastructure of KRISTINA is utilizing a cloud based architecture to stream video and audio data from the user to a remote server, where the data can be analyzed. This helps to reduce the workload of the users’ device which is a crucial part in bringing computational heavy tasks like emotion analysis or speech-to-text on a mobile device. To enable performant data streaming from the client to the server we are using the WebRTC software package. This enables us to capture audio and video directly from the browser of almost any mobile device as long as the utilized browser supports the technology. On the server side the data is received by a Node.js server which forwards it directly to the SSI with the help of FFmpeg. Through this utilization of web-technology we can move the signal processing part of KRISTINA away from the users’ device to a much more powerful server while keeping the whole system real-time capable.

The second novelty introduced to the SSI concerns the adaptability of the system to future research. As stated before the overall affective state of the user is inferred by analyzing voice, gestures and mimics. Each of those analyzes is implemented as separate component into the framework. While this is great to ensure the best performance and flawless communication between the components it also means that the used classification methods are restricted by the possibilities available within the framework. To be flexible in the future when the implementation has to keep up with the newest findings from the respective research teams we extended the SSIs classification algorithms with deep neural networks.   

The popularity of those classifiers strongly increased over the last few years due to the successful application of deep-learning algorithms to solve various machine learning problems the popularity of the method. Analogously the development of software libraries, which are aiming to hide the complexity of implementing those algorithms from researchers, rose. Therefore, a manifold of libraries like Theano, Keras or Tensorflow are now available to choose from. While most of them are working on a C or C++ basis for performance reasons, the huge advantage of those libraries comes from the ease of usage, which is usually achieved by building interfaces to more intuitive and flexible programming languages like Python.
To be able to utilize such libraries to design, train and deploy deep neural networks from within SSI, the framework has been extended to expose an interface to the python programming language. This approach enables the seamless integration of any python code or library into the framework while keeping the overhead as small as possible to not hurt performance for online data processing. The integration of a neural network into the analysis-pipeline can be done easily by implementing the respective interfaces that are expected by the SSI in the same python-script where the model is built.


[1] J. Wagner, F. Lingenfelser, T. Baur, I. Damian, F. Kistler and E. André, "The social signal interpretation (SSI) framework: multimodal signal processing and recognition in real-time," in Proceedings of the 21st ACM international conference on Multimedia, 2013.

multilingual intelligent embodied agent social competence adaptive dialogue expressive speech recognition and synthesis discourse generation vocal facial and gestural social and emotional cues