Learning Local Representations of Images and Text
Abstract:
Images and text inherently exhibit hierarchical structures, e.g. scenes built from objects, sentences built from words. In many computer vision and natural language processing tasks, learning accurate prediction models requires analyzing the correlation of the local primitives of both the input and output data. In this proposal, we aim to develop techniques for learning local representations of images and text and demonstrate their effectiveness on visual recognition, retrieval, and synthesis. In particular, the proposal includes three primary research projects: (1) Text2Scene, a sequence-to-sequence image synthesis framework which produces a scene depicted in a textual description by sequentially predicting objects, their locations, and their attributes such as sizes, aspect ratios; (2) DrillDown, an interactive image retrieval model which encodes multiple rounds of natural language queries with a region-aware state representation; (3) A newly proposed project which explores the task of instance-level image recognition/retrieval. The key ingredient of this work is a transformer-based model that learns the visual similarity of an image-pair by incorporating both the global and local features of the images.
Committee Members:
- Yanjun Qi, Committee Chair (Department of Computer Science),
- Vicente Ordonez, Advisor (Department of Computer Science),
- Yangfeng Ji (Department of Computer Science),
- Mona Kasra (Department of Drama),
- Ming-Hsuan Yang (UC Merced, Google Research),
- Connelly Barnes (Adobe Research)