Angelica Lim – Machine Learning for HRI: Bridging the Gap between Action and Perception

This post was written by Sage Hughes, an undergraduate researcher at the Digital Democracies Institute.

On November 23, Dr. Angelica Lim presented to the Digital Democracies Institute for the final installment of the 2022 Fall Speaker Series. Dr. Lim is an Assistant Professor of Professional Practice, as well as the director of the ROSIE Lab at Simon Fraser University– short for ‘Robots with Social Intelligence and Empathy’. She has a Ph.D. and M.Sc. in Computer Science from Kyoto University, Japan, specializing in artificial intelligence applied to Robotics, and received her B.Sc. in Computing Science from Simon Fraser University, Canada.

Dr. Lim’s presentation, “Social Signals in the Wild”, describes her research on multimodal machine learning for human robot interactions (HRI), and concludes with a discussion around how culture relates to this field.

Professional Career: “Pepper”

Beginning in 2012, Lim worked on developing Pepper: a comedian-programmed robot built to inform, converse, and entertain across a variety of settings. This social robot was first deployed in storefronts but has since had applications in numerous areas from business, research, to home environments. To Lim’s own surprise, she even spotted Pepper in Vancouver’s Coal Harbour, operating as an aid for tours. In any case, Pepper is primer for Lim’s subsequent work on how humour, gestures, and expression become integrated in human-robot interactions.

Interfaces and Multimodal Machine Learning

Dr. Angelica Lim situates the ‘interface’ as a central concept in her research. She asks, “How can we make the interface less about adapting to it, and more about it adapting to us?”

How do human-machine interfaces look like today, and how might they look 20 or 30 years from now? Does the form of flat OLED screens, auditory displays, or virtual reality systems work its way into a vision of the future?

Questions like this are at the crux of Dr. Lim’s research. Though not limited to robotics, this inquiry aligns with her use of multimodal machine learning for human-robot interactions. A robot that communicates seamlessly requires engagement with cognition, behaviour, and emotion. For this to occur, aspects of human interactions are modelled in the mechanical and computational processes of a given robot interface.

Often, we are required to adapt to the interfaces of ‘smart’ technology. Dr. Lim exemplifies this through voice recognition assistants. If you’ve ever used one (e.g., Amazon Alexa, Google Home, Apple’s Siri), you may have experienced how communication can get ‘boxed in’– that is, before you finish a thought, you’re prompted by an expeditious beep or otherwise auditory response. There’s a small window of time before the machine interrupts, signalling aurally, that it is done listening and is now working to decode a fragment of the whole spoken message. In other words, instead of the machine modelling us, we model the machine to get the desired response from our interaction.

Yet a mid-sentence pause is a quotidian part of oral communication. In everyday conversations, it’s normal to take a moment to think, glance up or down, or say “um” and “uh” in between. In a human-to-human context, we do so generally without being interrupted or experiencing long processing gaps. Reducing these inconsistencies between the communication of machine interfaces and humans is a place where Lim says multimodal machine learning has been employed.

Eye-contact example from a study conducted by Andrist, Sean., et al. Cognition refers to where subjects looked if they were thinking (represented by blue). Intimacy-regulation is where one might look to ensure they’re making an appropriate amount of eye-contact, and floor-management refers to “holding the floor”— signifying if it’s time to speak.

From Andrist, Sean., et al.’s observation of conversational gazes, eye-contact modes were integrated into the tracking and response behaviours of a robot. The conversations between humans and the robot showed reduced interruptions and turn gaps. As such, this also led to the bot being perceived as a more thoughtful conversation partner overall.

At the Rosie Lab, these themes trace through three main areas:

Building robots that are useful and interact naturally and seamlessly with humans.
Developing smart AI software to help robots understand what humans do, think, feel and mean.
Creating new AI algorithms and implementing models of the human mind based on neuroscience, psychology, and developmental science.

Expressive Multimodal Systems: Emotional Gestures and Musical Robots

During Dr. Lim’s time as a master’s student at Kyoto University, she developed adaptive, theremin-playing robots. That includes the NAO Thereminist (2011) and HRP-2 Thereminist (2010-2011): bots designed to recognize gestures, track beats, adapt to cues from musical performances, and integrate aspects of musical expression into their functions. By modelling expression, gestures, and these formal components of musical performances, the bots performed music more akin to musically trained humans.

Part of this earlier research also includes expressive components– how musical expression might be modelled and integrated into robot gestures. Speed, Intensity, Regularity and Extent (SIRE) are features that, Lim says, have been found important for emotion in research pertaining to music, voices, and movement. This led to the development of the SIRE model, which was employed to explore how dynamics of emotion might be modelled across different domains, including music or speech.

When empathetic Pepper robots were demonstrated, responses suggested they had sensed emotion. Still, Lim says it’s important to note that the empathetic response model was never fully deployed, as she puts it, “in the wild”.

This is because the expressions to detect were unknown. Further, the emotion recognition model worked for acted expressions, but intent or meaning behind facial expression varies. The smile of a person who feels happy is different from the convention of a ‘masking’ smile, but a robot may not always be able to interpret that. Another example Lim suggests is useful is the ‘neutral’ expression. Facial neutrality could stand for a negative baseline emotion like anger, sadness, or even fear.

Emotion Recognition vs. Social Signal Recognition

Tensions around the extent in which emotional recognition and representation can capture intent are constantly maneuvered in the world of robotics. Dr. Lim notes how facial emotion recognition has many challenges:

First, basic terms for emotions (sadness, anger, fear, etc.) are often too simplistic to capture complex feelings. In addition, interpreting human expression can include aspects of body language, gaze, or spatial orientation. Second, expression is shaped through cultural and physical circumstances. For human robot interactions to capture this accurately, a variety of conditions must be considered for inferring and modelling emotion.

“Emotion recognition is not what I aim to do. Because our datasets are labelled by other people, we don’t know if that’s how they actually feel.”

Lim says that this is where the distinction between emotion recognition, versus social signal recognition (i.e., expression recognition) comes in. The latter locates emotion through expressions or common emotional ‘codes.’ These codes are social signals that can be recognised; however, Lim notes that it should come with an understanding that one’s true emotion will never be fully apprehended.

Social Signals in the Wild: ‘Real World’ Data and Challenges

In 2017, Dr. Angelica Lim helped produce the UE-HRI dataset of spontaneous user-robot interactions. From matching expressions in these interactions, Lim, et al. (2021) produced a categorical taxonomy of emotional expressions with robots.

Initial taxonomy of social signals in the UE-HRI Dataset. Ghazal Saheb Jam, Jimin Rhim, and Angelica Lim. 2021.

In addition to the six prototypical emotion classes, the taxonomy incorporates 28 fine-grained social signal classes, represented by the emojis above. From this, Lim offers that researchers can use this segmentation and emoji annotation method to locate expressions in their data, as well as apply the taxonomy to identify expressions that go beyond the six general categories.

One interesting finding pointed out by Lim is the relatively high number of skeptical expressions found in their dataset, which might speak to the current interface milieu. Insofar as limitations are concerned, Lim noted that there are still gaps between how an expression is made versus the emotion inferred. Further, the UE-HRI dataset was primarily produced from interactions with French-language students in France: future work might account for linguistic and cultural variations in this range of expressions.

Addressing Challenges of Culture in MML for HRI

As mentioned earlier, classifiers commonly used in multi-modal expression recognition (e.g., facial emotion, vocal emotion, and speech sentiment) are highly dependent on the context of a given interaction. Such classifiers are weighted differently in relation to cultural groups. As Dr. Angelica Lim puts it, there shouldn’t be a “one-size fits all facial emotion recognition algorithm”, yet many automatic emotion recognition systems still train on data originating from North America and often contain majority Caucasian training samples. To address this, Lim has explored some of the cultural challenges in machine learning and robotics at the ROSIE Lab.

Towards Inclusive Human Robot Interactions

In Dr. Angelica Lim’s presentation, numerous examples were drawn on to illustrate how emotional expression varies on cultural grounds. The figure below shows a study conducted by ROSIE Lab that highlights the expressions of ethnic minority groups, as characterized through localized facial actions.

AU activation map that shows the localization of activation using a threshold of 2.5.

*Movie/TV shows as primary source; **Mix of Movie/TV shows, Reality TV and Vlogs as primary source; ***Reality TV and Vlogs as primary source. Emma Hughson, Roya Javadi, James Thompson, and Angelica Lim. 2021.

This approach addresses concerns around representation for ethnic minority groups in emotion expression recognition. Training a DNN with simulated data (unfrozen early layers) showed a 24% improvement in expression recognition for non-Caucasian groups, and a 10% improvement for Caucasian samples.

The technologies that Lim works with can also function as tools for cultural revitalization initiatives. This is especially pertinent with ongoing discussions around the history of residential schools and other cultural assimilation policies in Canada, for example, which has suppressed cultural practices and traditional languages of Indigenous groups. The Blackfoot Revitalisation Project, which Lim co-leads, addresses points of this by approaching AI, ML, and robotics as language acquisition tools.

As Dr. Lim’s research suggests, artificial intelligence provides more realistic modes of communication and senses of autonomy to robot agents. Of course, that comes with numerous caveats. The data these systems are trained on orient human-robot interactions, which can perpetuate injustice. Also, robots aren’t as smart as we might think– as these technologies develop, misrepresentations about their capabilities can increase. If the intentions and design of robots are not transparent, that can undermine trust in turn. Hence, as these tools grow and become more integrated in our lives, addressing bias should be an ongoing endeavour for the actors and agents involved in their development.

Dr. Lim puts it another way: addressing equity, diversity, and inclusion just makes these systems better. If the goal is to design naturally interacting, useful robots for humans, then it ought to be trained with the diversity of humans in mind.

Share this: