Responsible AI challenge @ ACMMM 25: Multimodal Hallucination Detection and Fact Checking

Leaderboard Task-1

TBD

Rank Team Score
1 Evalthon 0.982

Leaderboard Task-2

TBD

Rank Team Score
1 Evalthon 0.984

Introduction

The Responsible Multimodal AI Challenge aims to foster advancements in the development of reliable and trustworthy multimodal AI systems by addressing two crucial tasks: i) multimodal hallucination detection and ii) multimodal factuality detection. These tasks are designed to highlight the challenges and encourage innovative solutions for mitigating critical risks associated with generative multimodal AI. Task A, Multimodal Hallucination Detection, focuses on identifying hallucinated content in AI-generated captions for images. Participants will analyze captions to detect objects, attributes, or relationships that are fabricated or unsupported by the visual input. Task B, Multimodal Factuality Detection, emphasizes verifying the factuality of textual claims using both visual and contextual textual information. Participants will assess the factuality of claims in realworld scenarios. By addressing these tasks, the challenge seeks to promote the development of robust evaluation methodologies and algorithms that mitigate risks such as misinformation, bias, and errors in multimodal systems.

Challenge Tasks

The challenge contains two subtasks.

Task-1: Multimodal Hallucination Detection

Task 1: Multimodal Hallucination Detection.


The goal of this task is to detect hallucinated information in captions generated by multimodal AI systems. Participants are provided with an image and a caption generated by an AI system describing the image. Along with this, a predefined list of options is provided, where each option represents an object, attribute, or relationship mentioned in the caption. Participants must analyze the generated caption to determine which options in the list are hallucinated, meaning they are not supported or justified by the content of the given image. These hallucinations might include objects that do not appear in the image, attributes that do not match the visual characteristics, or relationships between objects that are incorrectly described. The task is treated as a multi-label classification problem, where multiple options in the list can simultaneously be hallucinations or non-hallucinations.

For this task, we will provide a dataset with a train set ,a dev set and a test set. The annotations are in the form of a JSON file. An example of the task metadata is shown below.

"metadata": {
        "image_id": "00385794700c832e.jpg",
        "system1_question": "What type of shelf is the produce arranged on in the image?",
        "system1_answer": "A produce shelf displays fresh, vibrant vegetables, including bunches of cilantro and other leafy greens placed neatly above a bed of carrots. The greenery is arranged on a blue wire shelf, and the overall setup showcases an organized and aesthetic presentation of fresh produce.",
        "system2_question": "Which of the following systems' outputs belong to hallucination?",
        "correct_answer": "blue wire shelf",
        "correct_choice": "C",
        "choices": [
            {
                "id": "A",
                "choice": "No hallucination"
            },
            {
                "id": "B",
                "choice": "leafy greens"
            },
            {
                "id": "C",
                "choice": "blue wire shelf"
            },
            {
                "id": "D",
                "choice": "fresh produce"
            }
        ]
    }

Task-2: Multimodal Fact Checking

Task 2: Event Argument Exraction.


This task focuses on verifying the factual accuracy of claims by analyzing multimodal inputs. Participants are given:
• A claim in textual form, which can be a news headline, a sentence from an article, or a social media post.
• An accompanying image related to the claim.
• Additional context in textual form, such as the full text of the news article, a related social media discussion, or other supplementary information.
Participants must determine the factual accuracy of the claim based on all the provided inputs. They need to assign one of four possible labels to the claim: "True", "False", "Partially True", and "Not Verifiable". This task is treated as a four-class classification problem, requiring participants to consider both the visual and textual evidence to assess the claim’s factuality comprehensively.

For this task, we will provide a dataset with a train set ,a dev set and a test set. The annotations are in the form of a JSON file. An example of the task metadata is shown below.

"metadata": {
        "image_id": "0234613.jpg",
        "claim": "Corbin Aoyagi a supporter of gay marriage waves a rainbow flag during a rally at the Utah State Capitol on Jan 28",
        "context": "The attorneys general of Virginia and Utah are bringing their state same-sex marriage bans to the Supreme Court. Utah Attorney General Sean Reyes filed a petition with the court, seeking a review of the ruling that struck down Utah's ban on same-sex marriage. Virginia Attorney General Mark Herring plans to do the same, arguing the ban is discriminatory. There is a push for a swift resolution due to the numerous legal victories for same-sex marriage advocates following last summer's Supreme Court decision. Utah's petition questions if the 14th Amendment prevents states from defining marriage as only between a man and a woman. Reyes emphasizes his duty to defend the state's constitution. Herring supports a quick final resolution to affirm marriage rights for all Virginians.",
        "question": "Based on the provided information, please determine whether the claim is factual.",
        "correct_answer": "True",
        "correct_choice": "A",
        "choices": [
            {
                "id": "A",
                "choice": "True"
            },
            {
                "id": "B",
                "choice": "False"
            },
            {
                "id": "C",
                "choice": "Partially True"
            },
            {
                "id": "D",
                "choice": "Not Verifiable"
            }
        ]
    }

Evaluation

The F1 score is computed using precision (P) and recall (R), which are calculated as follows:

P = TP / ( TP + FP )

R = TP / ( TP + FN )

F1 = 2 * P * R / ( P + R )

where TP, FP, and FN represent specific items that are used to calculate the F1 score in the context of a Confusion_matrix. In particular, when computing the micro-F1 score, TP corresponds to the number of predicted tuple that match exactly with those in the gold set.

Dataset

We will utilize synthetic and real-world multimodal datasets for these tasks, including balanced distributions of various scenarios. Table I and Table II are the dataset statistics. The datasets include diverse scenarios such as indoor, outdoor, social, and news contexts to ensure robust evaluation.

Dataset for Task 1.


Dataset for Task 2.


Participation and Submission

To participate in the challenge, please first register by submitting the form. We will send you the materials of the challenge, including manual, datasets and baseline model.

Please submit your models and predicted results (in json files "results.json") and zip in one file. Participants should submit by sending email to us xh218@sussex.ac.uk. We will review the submissions publish the ranking here.

Timeline

Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.

Registration Opens March 20, 2025
Training Data Release March 30, 2025
Challenge Result Submission Deadline May 20, 2025
Leaderboard Release June 1, 2025
Challenge Paper Submission Deadline June 15, 2025

Organizers

Xudong Han (University of Sussex, UK)

Kai Liu (National University of Singapore, Singapore)

Yanlin Li (National University of Singapore, Singapore)

Hao Li (Wuhan University, China)

Zheng Wang (Wuhan University, China)