Social Robots for Multimodal Interactions

About this research line

We investigate how social robots can engage in rich, multimodal interactions, combining speech, vision, gestures and touch, to have robots functioning naturally in real-world environments.

In social robots, buttons and screen interfaces are replaced by verbal and non-verbal communication. All robots in science fiction are social robots; they are all able to understand human actions and engage with people. This is in stark contrast to most robots we see today, which remain separated from us. Industrial robots are kept apart from people, and from the safety of a cage, they weld cars or fill boxes, never being aware of the richness of human activity around them. Research in social robotics aims to change that, but creating social robots is a formidable challenge.

Social interaction is possibly one of the biggest challenges in artificial intelligence and robotics. As social interaction uses all faculties of the human brain, such as memory, language, semantics, and emotion, we need to create artificial equivalents of all these. The latest developments in AI and machine learning are integrated to build robots that understand and change the social world. Our goal is to build and integrate AI into physical robots to create robots that can integrate into our human-inhabited environments.

Next to the technical aspects of building social robots, we are also interested in how we interact with robots. This study of Human-Robot Interaction uses insights and methods from related scientific fields, such as social psychology and design, to uncover how we respond to social robots. What aspects of the design of a robot help us to trust the robot? How persuasive is a robot? Can a robot help you cope with problems?

Designing robots that blend multiple input and output modalities presents significant challenges. Our research focuses on developing algorithms that enable robots to process and generate coordinated responses across different interaction channels. In addition, we use multidimensional measurements (including subjective and physiological measurements) to explore how humans perceived and response to these multimodal cues, ensuring that robots not only communicate effectively but also build trust and rapport with their users

Multimodal Input

Multimodal dialogue systems enable robots to engage in conversations that go beyond speech; for instance, incorporating visual information and user metadata to enhance communication.

Vision language models allow robots to understand and generate responses based on visual cues, such as recognising objects and complex situations without extensive modelling. This enables robots to initiate meaningful interactions, ask follow-up questions or provide relevant information based on the visual context. For example, if a user is wearing a bright red shirt, the robot could comment on it or use it as a reference point in the conversation.

In addition to visual cues, robots can also use user metadata, such as past interactions, preferences, and emotional states, to tailor their responses. Being aware of users’ preferences allows robots to align their behaviour with human values, ethics, and cultural norms. Value-aware robots must recognise and adapt to the moral and social expectations of their users, ensuring their actions are not only functional but also socially appropriate. We are developing frameworks that allow robots to reason and act according to human values, enabling them to make decisions that are sensitive to context and culture.

Multimodal input can improve the contextual understanding of robots in everyday environments.

Multimodal input can improve the contextual understanding of robots in everyday environments.

Gestures and Touch Interactions

Gesture and touch are powerful forms of nonverbal communication that are often combined with speech to convey emotion, intention, and empathy. For instance, when a robot employs a gentle touch and expressive gestures alongside speech, these cues can enhance the clarity, emotional resonance, and social salience of its communication, thereby attracting greater human attention and engagement. By integrating insights from human–robot interaction, cognitive science, and neuroscience, we seek to develop design principles for socially robots that can communicate attentiveness and empathy in a context-sensitive manner. We aim to develop human-centered social robots that can use these non-verbal cues to enhance emotional connection in human-robot interaction.

We explore different touch interactions with the Pepper robot.

We explore different touch interactions with the Pepper robot.

Long-Term Interactions in Care Homes

Social robots have the potential to revolutionise long-term care by providing consistent, personalised support to people touched by loneliness and social isolation, such as residents in care homes. These robots can act as companions, offering emotional support, engaging in conversations, and assisting with daily activities. Over time, they can build meaningful relationships with residents, helping to combat loneliness and improve overall well-being.

Pepper robot in a care home.

Pepper robot in a care home.

Active researchers

Related publications

Integrating visual context into language models for situated social conversation starters

Ruben Janssens, Pieter Wolfert, Thomas Demeester, Tony Belpaeme
In IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 2025
BIBLIO
Abstract
Embodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction.

Human-robot interaction : an introduction

Christoph Bartneck, Tony Belpaeme, Friederike Eyssel, Takayuki Kanda, Merel Keijsers, Selma Šabanović
2024
BIBLIO
Abstract
The role of robots in society keeps expanding and diversifying, bringing with it a host of issues surrounding the relationship between robots and humans. This introduction to human–robot interaction (HRI) by leading researchers in this developing field is the first to provide a broad overview of the multidisciplinary topics central to modern HRI research. Written for students and researchers from robotics, artificial intelligence, psychology, sociology, and design, it presents the basics of how robots work, how to design them, and how to evaluate their performance. Self-contained chapters discuss a wide range of topics, including speech and language, nonverbal communication, and processing emotions, plus an array of applications and the ethical issues surrounding them. This revised and expanded second edition includes a new chapter on how people perceive robots, coverage of recent developments in robotic hardware, software, and artificial intelligence, and exercises for readers to test their knowledge.

The DREAM dataset : supporting a data-driven study of autism spectrum disorder and robot enhanced therapy

Erik Billing, Tony Belpaeme, Haibin Cai, Hoang-Long Cao, Anamaria Ciocan, Cristina Costescu, Daniel David, Robert Homewood, Daniel Hernandez Garcia, Pablo Gomez Esteban, Honghai Liu, Vipul Nair, Silviu Matu, Alexandre Mazel, Mihaela Selescu, Emmanuel Senft, Serge Thill, Bram Vanderborght, David Vernon, Tom Ziemke
In PLOS ONE 2020
BIBLIO
Abstract
We present a dataset of behavioral data recorded from 61 children diagnosed with Autism Spectrum Disorder (ASD). The data was collected during a large-scale evaluation of Robot Enhanced Therapy (RET). The dataset covers over 3000 therapy sessions and more than 300 hours of therapy. Half of the children interacted with the social robot NAO supervised by a therapist. The other half, constituting a control group, interacted directly with a therapist. Both groups followed the Applied Behavior Analysis (ABA) protocol. Each session was recorded with three RGB cameras and two RGBD (Kinect) cameras, providing detailed information of children's behavior during therapy. This public release of the dataset comprises body motion, head position and orientation, and eye gaze variables, all specified as 3D data in a joint frame of reference. In addition, metadata including participant age, gender, and autism diagnosis (ADOS) variables are included. We release this data with the hope of supporting further data-driven studies towards improved therapy methods as well as a better understanding of ASD in general.
See more