The University of Southampton
University of Southampton Institutional Repository

ImagenWorld: stress-testing image generation models with explainable human evaluation on open-ended real-world tasks

ImagenWorld: stress-testing image generation models with explainable human evaluation on open-ended real-world tasks
ImagenWorld: stress-testing image generation models with explainable human evaluation on open-ended real-world tasks
Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.
Sani, Samin Mahdizadeh
dd456a44-8938-421d-b130-ac0c3a7216c8
Ku, Max
840a323c-459f-4dc3-91c1-e12e0dd2ba3f
Jamali, Nima
4dca3e14-ad03-4c26-9de6-b602d33c7f71
Sani, Matina Mahdizadeh
38acde6e-79d6-4086-9595-93e5f0d49e5f
Khoshtab, Paria
6c03f198-4306-4bfa-9298-56a0cd4d66ad
Sun, Wei-Chieh
6cafcb23-8a08-422b-bb0a-3d5f9b8ebd80
Fazel, Parnian
d417e054-ea2d-4baf-aa5f-f097ded69cd7
Tam, Zhi Rui
cfc09540-7709-4522-89c9-a84f093b5a3d
Chong, Thomas
26f72565-fb9c-4e0e-b90d-e7f9766f327c
Chan, Edisy Kin Wai
881c8106-2b61-41af-a542-dd802c72ace1
Tsang, Donald Wai Tong
7e35ab7f-9a82-47c3-a87f-5e46eefd8e5c
Hsu, Chiao-Wei
ce1c9e6a-8012-4bd5-a145-0e0615890c4c
Lam, Ting Wai
ec6cc652-f955-43cf-adc2-20719009debe
Ng, Ho Yin Sam
8f49d92c-4fe2-4f57-bf53-d3e8d5cc679d
Chu, Chiafeng
b639aac2-a105-48c7-8e9d-305aa0aaf904
Mak, Chak-Wing
0000d6f7-9a35-4a32-917b-e165dbfa0cd0
Wu, Keming
389fc8b9-ac2a-4bbf-84bb-69d8251d3fae
Wong, Hiu Tung
021f78c8-6f7b-4820-b0c7-e6a557a9efbd
Ho, Yik Chun
61479364-d010-4e9e-9e23-370a4974bee5
Ruan, Chi
f4db9518-e47e-46bb-ba79-dae5b265ac6e
Li, Zhuofeng
ecff04ff-abe8-4ec8-8f45-57af38d9768a
Fang, I-Sheng
231e2376-aee9-4dbb-b7dd-540a57e9cf83
Yeh, Shih-Ying
00c89cd0-2cf8-4d23-9a5f-44bf70897a59
Cheng, Ho Kei
36d4a306-9a37-4098-a33a-103d52f987f3
Nie, Ping
060816f8-706f-4f66-8a8a-c3e48bdf0d4d
Chen, Wenhu
b5952c1d-8856-45b5-9f18-06f8cce84ebd
et al.
Sani, Samin Mahdizadeh
dd456a44-8938-421d-b130-ac0c3a7216c8
Ku, Max
840a323c-459f-4dc3-91c1-e12e0dd2ba3f
Jamali, Nima
4dca3e14-ad03-4c26-9de6-b602d33c7f71
Sani, Matina Mahdizadeh
38acde6e-79d6-4086-9595-93e5f0d49e5f
Khoshtab, Paria
6c03f198-4306-4bfa-9298-56a0cd4d66ad
Sun, Wei-Chieh
6cafcb23-8a08-422b-bb0a-3d5f9b8ebd80
Fazel, Parnian
d417e054-ea2d-4baf-aa5f-f097ded69cd7
Tam, Zhi Rui
cfc09540-7709-4522-89c9-a84f093b5a3d
Chong, Thomas
26f72565-fb9c-4e0e-b90d-e7f9766f327c
Chan, Edisy Kin Wai
881c8106-2b61-41af-a542-dd802c72ace1
Tsang, Donald Wai Tong
7e35ab7f-9a82-47c3-a87f-5e46eefd8e5c
Hsu, Chiao-Wei
ce1c9e6a-8012-4bd5-a145-0e0615890c4c
Lam, Ting Wai
ec6cc652-f955-43cf-adc2-20719009debe
Ng, Ho Yin Sam
8f49d92c-4fe2-4f57-bf53-d3e8d5cc679d
Chu, Chiafeng
b639aac2-a105-48c7-8e9d-305aa0aaf904
Mak, Chak-Wing
0000d6f7-9a35-4a32-917b-e165dbfa0cd0
Wu, Keming
389fc8b9-ac2a-4bbf-84bb-69d8251d3fae
Wong, Hiu Tung
021f78c8-6f7b-4820-b0c7-e6a557a9efbd
Ho, Yik Chun
61479364-d010-4e9e-9e23-370a4974bee5
Ruan, Chi
f4db9518-e47e-46bb-ba79-dae5b265ac6e
Li, Zhuofeng
ecff04ff-abe8-4ec8-8f45-57af38d9768a
Fang, I-Sheng
231e2376-aee9-4dbb-b7dd-540a57e9cf83
Yeh, Shih-Ying
00c89cd0-2cf8-4d23-9a5f-44bf70897a59
Cheng, Ho Kei
36d4a306-9a37-4098-a33a-103d52f987f3
Nie, Ping
060816f8-706f-4f66-8a8a-c3e48bdf0d4d
Chen, Wenhu
b5952c1d-8856-45b5-9f18-06f8cce84ebd

Sani, Samin Mahdizadeh, Ku, Max and Jamali, Nima , et al. (2026) ImagenWorld: stress-testing image generation models with explainable human evaluation on open-ended real-world tasks. In The Fourteenth International Conference on Learning Representations. 31 pp .

Record type: Conference or Workshop Item (Paper)

Abstract

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

This record has no associated files available for download.

More information

Published date: 28 February 2026
Venue - Dates: ICLR 2026: The Fourteenth International Conference on Learning Representations, , Rio de Janeiro, Brazil, 2026-04-23 - 2026-04-27

Identifiers

Local EPrints ID: 510352
URI: http://eprints.soton.ac.uk/id/eprint/510352
PURE UUID: 20f2de4e-ba82-4032-aeae-f5bc08917e63
ORCID for Edisy Kin Wai Chan: ORCID iD orcid.org/0009-0005-7598-5283

Catalogue record

Date deposited: 27 Mar 2026 17:31
Last modified: 28 Mar 2026 03:16

Export record

Contributors

Author: Samin Mahdizadeh Sani
Author: Max Ku
Author: Nima Jamali
Author: Matina Mahdizadeh Sani
Author: Paria Khoshtab
Author: Wei-Chieh Sun
Author: Parnian Fazel
Author: Zhi Rui Tam
Author: Thomas Chong
Author: Edisy Kin Wai Chan ORCID iD
Author: Donald Wai Tong Tsang
Author: Chiao-Wei Hsu
Author: Ting Wai Lam
Author: Ho Yin Sam Ng
Author: Chiafeng Chu
Author: Chak-Wing Mak
Author: Keming Wu
Author: Hiu Tung Wong
Author: Yik Chun Ho
Author: Chi Ruan
Author: Zhuofeng Li
Author: I-Sheng Fang
Author: Shih-Ying Yeh
Author: Ho Kei Cheng
Author: Ping Nie
Author: Wenhu Chen
Corporate Author: et al.

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×