Apple’s latest research released the MAD-Bench benchmark test, which is designed to evaluate the robustness of multi-modal large language models (MLLMs) when dealing with misleading information. This study comprehensively evaluates the ability of MLLMs to handle text and image consistency through 850 pairs of image prompts, providing valuable reference data for the development of MLLMs. The establishment of this benchmark will help improve the reliability and anti-interference capabilities of AI models and promote the healthy development of AI technology.
Apple Research proposed the MAD-Bench benchmark to solve the problem of vulnerability of multi-modal large language models (MLLMs) in handling misleading information. This study consisted of 850 image-cue pairs and evaluated the ability of MLLMs to handle text and image congruence. The study found that GPT-4V performed better in scene understanding and visual confusion, providing important tips for designing AI models. Through the MAD-Bench benchmark, the robustness of the AI model will be improved, and future research will be more reliable.The emergence of the MAD-Bench benchmark marks a new stage in AI model evaluation. In the future, more and more reliable benchmarks will appear, promoting the development of AI technology more safely and reliably, and bringing more benefits to human society.