The University of Southampton
University of Southampton Institutional Repository

Learning representations of visual semantics for out-of-distribution generalization

Learning representations of visual semantics for out-of-distribution generalization
Learning representations of visual semantics for out-of-distribution generalization
Building systems that can understand and represent visual semantic knowledge is one of the fundamental problems towards artificial general intelligence. Much of the previous work on visual semantic understanding has used visual semantic embedding (VSE) to align well-represented visual features and language features. However, these techniques are insufficient to generalize beyond the data they have seen during training. In this thesis, we initially revisit the hierarchy of levels between the raw media and full semantics. Then we propose the hypothesis that learning multimodal knowledge representations which can be recomposed dynamically as needed is a path towards out-of-distribution (OOD) visual semantic understanding. The first focus of this thesis is the behaviour of current VSE systems. We develop a variety of probing techniques to answer the question: what kind of semantic information in unimodal pre-training is learnt by VSE? We show that static relational information in large text corpus and expert-made knowledge bases does not remain in the semantic space of VSE models. When VSE models learn contextual information with frozen language models, a mutual exclusivity bias is lacked. This limits their performance on OOD recognition. The second focus of this thesis is to understand the ability of pre-trained multimodal models. We try to answer the question: do multimodal models systematically generalize, and to what extent do they understand and produce new compositions from known concepts? Across many experiments, we demonstrate that the language encoder in a pre-trained multimodal model plays an important role in both producing concept compositions and enhancing unfamiliar visual concepts. Based on the above two concerns, the final chapter of the thesis confirms that multimodal pre-training plays a core role in OOD semantic understanding. Future research of learning visual semantics towards the objective of OOD generalization should first develop probing tools to explore how visual concepts emerge in pre-trained multimodal models.
University of Southampton
Jiao, Yue
e0dec959-0981-44c7-a6f9-5c8f4c28465a
Jiao, Yue
e0dec959-0981-44c7-a6f9-5c8f4c28465a
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9

Jiao, Yue (2023) Learning representations of visual semantics for out-of-distribution generalization. University of Southampton, Doctoral Thesis, 117pp.

Record type: Thesis (Doctoral)

Abstract

Building systems that can understand and represent visual semantic knowledge is one of the fundamental problems towards artificial general intelligence. Much of the previous work on visual semantic understanding has used visual semantic embedding (VSE) to align well-represented visual features and language features. However, these techniques are insufficient to generalize beyond the data they have seen during training. In this thesis, we initially revisit the hierarchy of levels between the raw media and full semantics. Then we propose the hypothesis that learning multimodal knowledge representations which can be recomposed dynamically as needed is a path towards out-of-distribution (OOD) visual semantic understanding. The first focus of this thesis is the behaviour of current VSE systems. We develop a variety of probing techniques to answer the question: what kind of semantic information in unimodal pre-training is learnt by VSE? We show that static relational information in large text corpus and expert-made knowledge bases does not remain in the semantic space of VSE models. When VSE models learn contextual information with frozen language models, a mutual exclusivity bias is lacked. This limits their performance on OOD recognition. The second focus of this thesis is to understand the ability of pre-trained multimodal models. We try to answer the question: do multimodal models systematically generalize, and to what extent do they understand and produce new compositions from known concepts? Across many experiments, we demonstrate that the language encoder in a pre-trained multimodal model plays an important role in both producing concept compositions and enhancing unfamiliar visual concepts. Based on the above two concerns, the final chapter of the thesis confirms that multimodal pre-training plays a core role in OOD semantic understanding. Future research of learning visual semantics towards the objective of OOD generalization should first develop probing tools to explore how visual concepts emerge in pre-trained multimodal models.

Text
Yue_Jiao_Doctoral_Thesis - Version of Record
Available under License University of Southampton Thesis Licence.
Download (10MB)
Text
Y Jiao Permission to deposit thesis - form
Restricted to Repository staff only
Available under License University of Southampton Thesis Licence.

More information

Submitted date: November 2021
Published date: January 2023

Identifiers

Local EPrints ID: 473733
URI: http://eprints.soton.ac.uk/id/eprint/473733
PURE UUID: 6759d086-937b-4ba7-b26c-97173d205c93
ORCID for Jonathon Hare: ORCID iD orcid.org/0000-0003-2921-4283

Catalogue record

Date deposited: 30 Jan 2023 19:51
Last modified: 17 Mar 2024 03:05

Export record

Contributors

Author: Yue Jiao
Thesis advisor: Jonathon Hare ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×