A dynamic pool of multimodal interface technologies is provided in the CALLAS Shelf, where software components are included, selected on the basis of their proven efficiency and robustness, or newly developed in CALLAS, to guarantee consistent performance for many contexts and scenarios.
These components deal with:
- processing and interpreting signals in terms of emotional and affective categories (More)
- rendering emotions through music, emotional language and virtual humanoid representations (More)
- integration, mapping and fusion for multimodal emotion recognition (More)
Suggested reading (see Public deliverables):
- Identification and Selection of Modules: update October 2007
- Shelf Selection of new models: update October 2008
- Shelf Components 1st Release: update October 2007
- Emotional Natural Language generator: update October 2007
- Specification for Model of Awareness: update October 2008
- Integrated Model of expressive and attentive capabilities: update October 2008
- Affective Music Synthesis: update October 2008
- Final Report on Multimodal Components: update April 2010
- Final Report on ECAs for affective output: update April 2010
CALLAS components processing signals from microphones, camera, haptic devices, mobile phones, Wiimote, audio and video:
- Multikeyword Spotting: a component recognizing when one of a pre-defined set of utterances occurs in speech, useful to select different paths in an application or to evaluate the users' feeling, indirectly driving application changes. It is speaker-independent and can be run in automatic or push-to-talk mode. The list of words to be recognized as well as the language can be changed at runtime. More
- Real-time emotion recognition from speech: a framework for building an emotion classifier and for recognizing emotions in real-time. It extracts from speech signals a vector of emotion-relevant acoustic features (e.g. derived from pitch, energy, voice quality, pauses, spectral information) and then it uses a statistical classifier, trained by examples, to assign emotion labels. More
- Emotional text analyser: a component making use of linguistic information relevant for lexical affect sensing to recognize emotions from text by statistical or semantic analysis. More
- Audio Feature extraction: taking input from live audio, it classifyes audio streams by different sound classes such as speech, music, silence, constant and variable, clapping, whistling and applause. It provides in output a corresponding audio class for each audio frame. More
- Video Feature Extraction: extracting faces from video sequence or live camera feed to derive information about the emotional state, content and context, keeping track of the amount of people looking towards the camera, and deriving interesting cues about the state of the audience's interest. The component is also a video player and it plays video files and captures live feed from camera. More.
- Human Glove Wearable Interface for Motion Capture: based on a data glove device as a sensing unit, the component is capturing motion data from sensors, to record the full body motion, and it is integrated with an Inertial Platform (consisting of accelerometers, gyroscopes and magnetometers) making it suitable for emotion extraction. More
- Video-Based Gesture Expressivity Features Extraction: a video-based component detecting and tracking the user's hands to extract and transmit expressivity features' values such as overall activation, spatial extent, temporal, fluidity, power. More
- WiiGLE: a component classifying hand movements in a 3D space based on the analysis of acceleration data captured from a Nintendo's Wiimote controller. It is supported by a corpora of arbitrary gestures used to train classifiers which are used for online recognition of gestures. More
- Gesture recognition from mobile phones: a component using the mobile phone with accelerometers as a sensor, mapping types of movement defined by expressivity parameters (e.g. graceful/fast tempo) to different emotions. More
- Gaze detection and Head Pose estimation: a component estimating human head movements (yaw, pitch, roll), and direction of the eyes related to a user in front of the content of a computer monitor, deriving information about his state: attentive, distracted or nervous. More
- Facial feature detection: detecting and tracking different facial features (such as eye centers, eye corners, top-down eyelids,..) based on facial geometry and prototypes of natural human motion. More
- Facial Expression Recognition: recognizing in real-time facial expression from localizing and tracking of facial features movements, based on the appearance of the expression of a person when interacting with a camera, and providing feedback regarding emotion recognition based on dimensional or Ekmanian emotions. More
CALLAS components rendering emotions in terms of speech, music, laughter and animated ECAs:
- Emotional Natural Language Generator: understanding the emotional state of a speaker from "what" and "how" a speaker is linguistically expressing itself. It is based on an annotated corpus consisting of sentences that present typical expressions used in a conversation. More
- Affective Music Synthesis: rendering the user's emotive state by real-time generation of an affectivisation of music. Some key characteristics of the music are altered in response to the changing mood of the user psychologically correlated and expressed in the PAD model. More
- Acoustic Awareness: analysing and properly reacting to laughs allowing an ECA to join his conversational partners' laughter: More
- Emotional Attentive ECA: interacting with a user trough a rich palette of verbal and nonverbal behaviours of a real-time 3D female agent. Communicative intentions of the listener are rendered by talking and simultaneously showing facial expressions, gestures, gaze, and head movements. More
- Augmented Reality Output Component: an End-User application for Augmented Reality visualization, that allows scripting and fast assembly of AR interactions. More
CALLAS components devoted to multimodal emotion recognition:
- Low Level Multimodal Fusion: supporting machine learning to the feature-extraction component to provide unimodal emotion recognition output and putting together features from individual modalities, catering for early fusion and emotion recognition. More
- Smart sensor integration: featuring integration of a single or multiple sensors into multimedia applications. It allows a developer to quickly turn standard sensors, such as microphone, camera, or wiimote into "smart" sensors, presenting information in a form that meets the requirements of the application as effectively as possible. More
- Ad hoc multimodal semantic fusion components: combining affective results from components into a dimensional model through PAD (Pleasure-Arousal-Dominance) for an overall affective representation of user interactions. More