Responsible AI challenge @ ICME 25: Multimodal Hallucination Detection and Fact Checking

Leaderboard Task-1

Congratulations to the winners!

Rank	Team	Score
1	KILL-Storm	74.8
2	VisionForge-SJTU	72.5
3	DeepRouteLab	70.7
4	SynapseAI-USTC	69.9
5	MultiMind	68.3

Leaderboard Task-2

Congratulations to the winners!

Rank	Team	Score
1	KILL-Storm	80.6
2	Z-Agents-ZJU	79.2
3	UniFusion	78.1
4	LOPE	76.4
5	VisionForge-SJTU	75.0

Introduction

The Responsible Multimodal AI Challenge aims to foster advancements in the development of reliable and trustworthy multimodal AI systems by addressing two crucial tasks: i) multimodal hallucination detection and ii) multimodal factuality detection. These tasks are designed to highlight the challenges and encourage innovative solutions for mitigating critical risks associated with generative multimodal AI. Task A, Multimodal Hallucination Detection, focuses on identifying hallucinated content in AI-generated captions for images. Participants will analyze captions to detect objects, attributes, or relationships that are fabricated or unsupported by the visual input. Task B, Multimodal Factuality Detection, emphasizes verifying the factuality of textual claims using both visual and contextual textual information. Participants will assess the factuality of claims in realworld scenarios. By addressing these tasks, the challenge seeks to promote the development of robust evaluation methodologies and algorithms that mitigate risks such as misinformation, bias, and errors in multimodal systems.

Challenge Tasks

The challenge contains two subtasks.

Task-1: Multimodal Hallucination Detection

Task 1: Multimodal Hallucination Detection.

The goal of this task is to detect hallucinated information in captions generated by multimodal AI systems. Participants are provided with an image and a caption generated by an AI system describing the image. Along with this, a predefined list of options is provided, where each option represents an object, attribute, or relationship mentioned in the caption. Participants must analyze the generated caption to determine which options in the list are hallucinated, meaning they are not supported or justified by the content of the given image. These hallucinations might include objects that do not appear in the image, attributes that do not match the visual characteristics, or relationships between objects that are incorrectly described. The task is treated as a multi-label classification problem, where multiple options in the list can simultaneously be hallucinations or non-hallucinations.

Task-2: Multimodal Fact Checking

Task 2: Event Argument Exraction.

This task focuses on verifying the factual accuracy of claims by analyzing multimodal inputs. Participants are given:
• A claim in textual form, which can be a news headline, a sentence from an article, or a social media post.
• An accompanying image related to the claim.
• Additional context in textual form, such as the full text of the news article, a related social media discussion, or other supplementary information.
Participants must determine the factual accuracy of the claim based on all the provided inputs. They need to assign one of four possible labels to the claim: "True", "False", "Partially True", and "Not Verifiable". This task is treated as a four-class classification problem, requiring participants to consider both the visual and textual evidence to assess the claim’s factuality comprehensively.

Evaluation

The F1 score is computed using precision (P) and recall (R), which are calculated as follows:

P = TP / ( TP + FP )

R = TP / ( TP + FN )

F1 = 2 * P * R / ( P + R )

where TP, FP, and FN represent specific items that are used to calculate the F1 score in the context of a Confusion_matrix. In particular, when computing the micro-F1 score, TP corresponds to the number of predicted tuple that match exactly with those in the gold set.

Dataset

We will utilize synthetic and real-world multimodal datasets for these tasks, including balanced distributions of various scenarios. Table I and Table II are the dataset statistics. The datasets include diverse scenarios such as indoor, outdoor, social, and news contexts to ensure robust evaluation.

Dataset for Task 1.

Dataset for Task 2.

Participation and Submission

To participate in the challenge, please first register by submitting the form.

Please submit your models and predicted results (in json files "results.json") and zip in one file. Participants should submit by sending email to us kai.lk@u.nus.edu. We will review the submissions publish the ranking here.

Timeline

Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.

Registration Opens	~~February 01, 2025~~
Training Data Release	~~February 30, 2025~~
Challenge Result Submission Deadline	~~March 25, 2025~~
Leaderboard Release	~~April 1, 2025~~
Challenge Paper Submission Deadline	~~April 5, 2025~~

Organizers

Kai Liu (National University of Singapore, Singapore)

Xudong Han (University of Sussex, UK)

Yanlin Li (National University of Singapore, Singapore)

Hao Li (Wuhan University, China)

Zheng Wang (Wuhan University, China)