The University of Southampton
University of Southampton Institutional Repository

Freestyle layout-to-image synthesis

Freestyle layout-to-image synthesis
Freestyle layout-to-image synthesis
Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees “a unicorn sitting on a bench” during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This “plug-in” is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications. Code is available at https://github.com/essunny310/FreestyleNet.
14256-14266
IEEE
Xue, Han
148f09fa-242b-4535-b179-1703a2c4ccc1
Huang, Zhiwu
84f477cd-9097-44dd-a33e-ff71f253d36b
Sun, Qianru
01f1c1d8-e176-461d-b4e1-324192629a3b
Song, Li
e75cb13d-d924-40fe-ba63-16c0f35bfbc2
Zhang, Wenjun
e514c8ef-8903-4fc6-9993-154cb9c76ad1
Xue, Han
148f09fa-242b-4535-b179-1703a2c4ccc1
Huang, Zhiwu
84f477cd-9097-44dd-a33e-ff71f253d36b
Sun, Qianru
01f1c1d8-e176-461d-b4e1-324192629a3b
Song, Li
e75cb13d-d924-40fe-ba63-16c0f35bfbc2
Zhang, Wenjun
e514c8ef-8903-4fc6-9993-154cb9c76ad1

Xue, Han, Huang, Zhiwu, Sun, Qianru, Song, Li and Zhang, Wenjun (2023) Freestyle layout-to-image synthesis. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. pp. 14256-14266 . (doi:10.1109/CVPR52729.2023.01370).

Record type: Conference or Workshop Item (Paper)

Abstract

Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees “a unicorn sitting on a bench” during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This “plug-in” is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications. Code is available at https://github.com/essunny310/FreestyleNet.

This record has no associated files available for download.

More information

Published date: 22 August 2023
Venue - Dates: Computer Vision and Pattern Recognition (CVPR), 2023, , Vancouver, Canada, 2023-06-18

Identifiers

Local EPrints ID: 500970
URI: http://eprints.soton.ac.uk/id/eprint/500970
PURE UUID: a579dea5-6773-4d55-9170-c6721bc6c1ea
ORCID for Zhiwu Huang: ORCID iD orcid.org/0000-0002-7385-079X

Catalogue record

Date deposited: 20 May 2025 16:33
Last modified: 21 May 2025 02:10

Export record

Altmetrics

Contributors

Author: Han Xue
Author: Zhiwu Huang ORCID iD
Author: Qianru Sun
Author: Li Song
Author: Wenjun Zhang

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×