Learning Local Representations of Images and Text
Images and text inherently exhibit hierarchical structures, e.g. scenes built from objects, sentences built from words. In many computer vision and natural language processing tasks, learning accurate prediction models requires analyzing the correlation of the local primitives of both the input and output data. In this thesis, we develop techniques for learning local representations of images and text and demonstrate their effectiveness on visual recognition, retrieval, and synthesis. In particular, the thesis includes three primary research projects:
In the first project, we explore the benefits of learning compositional image representation for text-to-image generation. The latest text-to-image generation research is dominated by Generative Adversarial Network (GAN) based methods, which predicts pixel-wise intensity values. While demonstrating remarkable results, these methods still have difficulties generating complex scenes with multiple interacting objects. In this work, we propose to model the local structures instead of the raw pixel values of the images. We develop a sequence-to-sequence image synthesis framework that produces a scene depicted in a textual description by sequentially predicting objects, their locations, and their attributes such as sizes, aspect ratios. Compared to previous GAN-based approaches, our method achieves competitive or superior performance while producing more interpretable results.
In the second project, we show the advantage of learning compositional text representation for interactive image search using multiple rounds of text queries. Cross-modal image search is a well-studied research topic where most of the recent approaches focus on learning a linear embedding space of the visual and textual data. We observe that this global representation cannot distinguish object instances that share the same feature space. Thus we propose an effective framework that encodes multiple rounds of natural language queries with a region-aware state representation and show that it outperforms existing sequential encoding and embedding models on both the simulated and real user queries.
In the third project, we focus on learning the visual relation of an image-pair in the context of reranking image search results for instance image recognition. In particular, we propose a lightweight and straightforward pipeline that learns to predict the similarity of an image-pair directly. The key ingredient of this work is a transformer-based architecture that models the interactions between the global/local descriptors within the individual image and across the image pair. Our experiments show that the proposed method outperforms previous approaches while using much fewer local descriptors. It can also be jointly optimized with the feature extractor, leading to further accuracy improvement.
- Yanjun Qi, Chair, (CS/SEAS/UVA)
- Vicente Ordóñez Román, Advisor, (CS/SEAS/UVA)
- Yangfeng Ji, (CS/SEAS/UVA)
- Mona Kasra (Drama/GSAS/UVA)
- Ming-Hsuan Yang (UC Merced, Google Research)
- Connelly Barnes (Adobe Research)