Research portal

Emergence of language structures from exposure to visually grounded speech signal

Research output: Contribution to conferenceAbstract

A variety of computational models can learn meanings of words and sentences from exposure to word sequences coupled with the perceptual context in which they occur. More recently, neural network models have been applied to more naturalistic and more challenging versions of this problem: for example phoneme sequences, or raw speech audio signal accompanied by correlated visual features. In this work we introduce a multi-layer recurrent neural network model which is trained to project spoken sentences and their corresponding visual scene features into a shared semantic space. We then investigate to what extent representations of linguistic structures such as discrete words emerge in this model, and where within the network architecture they are localized. Our ultimate goal it to trace how auditory signals are progressively refined into meaning representations, and how this processes is learned from grounded speech data.
Original languageEnglish
StatePublished - 2017
EventComputational Linguistics in the Netherlands 27 - KU Leuven, Leuven, Belgium
Duration: 10 Feb 2017 → …


ConferenceComputational Linguistics in the Netherlands 27
Period10/02/17 → …
Internet address

    Research areas

  • speech, language and vision, cross-situational learning, grounding, neural networks, representation learning
Login to Pure (for TiU staff only)