Worgan, Simon F.
Modelling the emergence of a basis for vocal communication between artificial agents
University of Southampton, School of Electronics and Computer Science,
Understanding the human faculty for speech presents a fundamental and complex problem. We do not know how humans decode the rapid speech signal and the origins and evolution of speech remain shrouded in mystery. Speakers generate a continuous stream of sounds apparently devoid of any specifying invariant features. Despite this absence, we can effortlessly decode this stream and comprehend the utterances of others. Moreover, the form of these utterances is shared and mutually understood by a large population of speakers. In this thesis, we present a multi-agent model that simulates the emergence of a system with shared auditory features and articulatory tokens. Based upon notions of intentionality and the absence of specifying invariants, each agent produces and perceives speech, learning to control an articulatory model of the vocal tract and perceiving the resulting signal through a biologically plausible artificial auditory system. By firmly establishing each aspect of our model in current phonetic theory, we are able to make useful claims and justify our inevitable abstractions. For example, Lindblom’s theory of hyper- and hypo-articulation, where speakers seek maximum auditory distinction for minimal articulatory effort, justifies our choice of an articulatory vocal tract coupled with a direct measure of effort. By removing the abstractions of previous phonetic models we have been able to reconsider the current assumption that specifying invariants, in either the auditory or articulatory domain, must indicate the presence of auditory or articulatory symbolic tokens in the cognitive domain. Rather we consider speech perception to proceed through Gibsonian direct realism where the signal is manipulated by the speaker to enable the perception of the affordances within speech. We conclude that the speech signal is constrained by the intention of the speaker and the structure of the vocal tract and decoded through an interaction of the peripheral auditory system and complex pattern recognition of multiple acoustic cues. Far from passive ‘variance mopping’, this recognition proceeds through the constant refinement of an unbroken loop between production and perception.
Actions (login required)