Social Robots for Multimodal Interactions

About this research line

We investigate how social robots can engage in rich, multimodal interactions, combining speech, vision, gestures and touch, to have robots functioning naturally in real-world environments.

In social robots, buttons and screen interfaces are replaced by verbal and non-verbal communication. All robots in science fiction are social robots; they are all able to understand human actions and engage with people. This is in stark contrast to most robots we see today, which remain separated from us. Industrial robots are kept apart from people, and from the safety of a cage, they weld cars or fill boxes, never being aware of the richness of human activity around them. Research in social robotics aims to change that, but creating social robots is a formidable challenge.

Social interaction is possibly one of the biggest challenges in artificial intelligence and robotics. As social interaction uses all faculties of the human brain, such as memory, language, semantics, and emotion, we need to create artificial equivalents of all these. The latest developments in AI and machine learning are integrated to build robots that understand and change the social world. Our goal is to build and integrate AI into physical robots to create robots that can integrate into our human-inhabited environments.

Next to the technical aspects of building social robots, we are also interested in how we interact with robots. This study of Human-Robot Interaction uses insights and methods from related scientific fields, such as social psychology and design, to uncover how we respond to social robots. What aspects of the design of a robot help us to trust the robot? How persuasive is a robot? Can a robot help you cope with problems?

Designing robots that blend multiple input and output modalities presents significant challenges. Our research focuses on developing algorithms that enable robots to process and generate coordinated responses across different interaction channels. In addition, we use multidimensional measurements (including subjective and physiological measurements) to explore how humans perceived and response to these multimodal cues, ensuring that robots not only communicate effectively but also build trust and rapport with their users

Multimodal Input

Multimodal dialogue systems enable robots to engage in conversations that go beyond speech; for instance, incorporating visual information and user metadata to enhance communication.

Vision language models allow robots to understand and generate responses based on visual cues, such as recognising objects and complex situations without extensive modelling. This enables robots to initiate meaningful interactions, ask follow-up questions or provide relevant information based on the visual context. For example, if a user is wearing a bright red shirt, the robot could comment on it or use it as a reference point in the conversation.

In addition to visual cues, robots can also use user metadata, such as past interactions, preferences, and emotional states, to tailor their responses. Being aware of users’ preferences allows robots to align their behaviour with human values, ethics, and cultural norms. Value-aware robots must recognise and adapt to the moral and social expectations of their users, ensuring their actions are not only functional but also socially appropriate. We are developing frameworks that allow robots to reason and act according to human values, enabling them to make decisions that are sensitive to context and culture.

Multimodal input can improve the contextual understanding of robots in everyday environments.

Multimodal input can improve the contextual understanding of robots in everyday environments.

Gestures and Touch Interactions

Gesture and touch are powerful forms of nonverbal communication that are often combined with speech to convey emotion, intention, and empathy. For instance, when a robot employs a gentle touch and expressive gestures alongside speech, these cues can enhance the clarity, emotional resonance, and social salience of its communication, thereby attracting greater human attention and engagement. By integrating insights from human–robot interaction, cognitive science, and neuroscience, we seek to develop design principles for socially robots that can communicate attentiveness and empathy in a context-sensitive manner. We aim to develop human-centered social robots that can use these non-verbal cues to enhance emotional connection in human-robot interaction.

We explore different touch interactions with the Pepper robot.

We explore different touch interactions with the Pepper robot.

Long-Term Interactions in Care Homes

Social robots have the potential to revolutionise long-term care by providing consistent, personalised support to people touched by loneliness and social isolation, such as residents in care homes. These robots can act as companions, offering emotional support, engaging in conversations, and assisting with daily activities. Over time, they can build meaningful relationships with residents, helping to combat loneliness and improve overall well-being.

Pepper robot in a care home.

Pepper robot in a care home.

Active researchers

Related publications

Integrating visual context into language models for situated social conversation starters

Ruben Janssens, Pieter Wolfert, Thomas Demeester, Tony Belpaeme
In IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 2025
BIBLIO
Abstract
Embodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction.

Designing social robots with LLMs for engaging human interaction

Maria Jose Pinto Bernal, Matthijs Biondina, Tony Belpaeme
In APPLIED SCIENCES-BASEL 2025
BIBLIO
Abstract
Large Language Models (LLMs), particularly those enhanced through Reinforcement Learning from Human Feedback, such as ChatGPT, have opened up new possibilities for natural and open-ended spoken interaction in social robotics. However, these models are not inherently designed for embodied, multimodal contexts. This paper presents a user-centred approach to integrating an LLM into a humanoid robot, designed to engage in fluid, context-aware conversation with socially isolated older adults. We describe our system architecture, which combines real-time speech processing, layered memory summarisation, persona conditioning, and multilingual voice adaptation to support personalised, socially appropriate interactions. Through iterative development and evaluation, including in-home exploratory trials with older adults (n = 7) and a preliminary study with young adults (n = 43), we investigated the technical and experiential challenges of deploying LLMs in real-world human-robot dialogue. Our findings show that memory continuity, adaptive turn-taking, and culturally attuned voice design enhance user perceptions of trust, naturalness, and social presence. We also identify persistent limitations related to response latency, hallucinations, and expectation management. This work contributes design insights and architectural strategies for future LLM-integrated robots that aim to support meaningful, emotionally resonant companionship in socially assistive settings.

I was blind but now I see : implementing vision-enabled dialogue in social robots

Giulio Antonio Abbo, Tony Belpaeme
In 2025 20TH ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION, HRI 2025
BIBLIO
Abstract
In the rapidly evolving landscape of human-robot interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents a ready-to-use implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4o mini) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, en-sures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. The system can be customised and is available as a stand-alone application, a Furhat robot implementation, and a ROS2 package.
See more