Visual Grounding in Large Vision Language Models 

Abstract:

With the growing popularity of Large Language Models, efforts have been made to extend their capabilities to the visual domain. This development paves the way for natural language interactions with images and videos, enabling the creation of visual assistants that can interpret images, provide guidance in unfamiliar environments, and generate detailed descriptions of visual content. In this presentation, I will outline our approach to leveraging existing large vision and language models to achieve high-fidelity semantic representations of 3D environments and understanding 3D spatial relationships. Lastly, I will discuss how these enhanced visual capabilities can be applied to embodied AI tasks, such as object search, visual question answering and following navigation instructions.

 

About the Speaker:

Jana Kosecka is a Professor at George Mason University at the Department of Computer Science. Her research areas are Computer Vision, Robotics and Embodied AI. She focuses on 'seeing' systems engaged in autonomous tasks, the acquisition of static and dynamic models of environments by means of visual sensors and human-robot-computer interaction. She has over 200 publications in refereed journals and conferences and is a co-author of a monograph titled Invitation to 3D vision: From Images to Geometric Models. Prof. Kosecka is an Associate Editor in Chief of IEEE Transactions of Pattern Recognition and Machine Analysis, is a former chair of the IEEE RAS Technical Committee of Robot Perception, Associate Editor of IEEE Robotics and Automation Letters and International Journal of Computer Vision. She has held visiting positions at Stanford University, Google, and Nokia Research. Prior to joining George Mason, she was a postdoctoral fellow at the EECS Department at University of California, Berkeley, affiliated with Robotics Laboratory and PATH. and earned her Ph.D. in Computer Science from the University of Pennsylvania, Philadelphia. Prof. Kosecka has received the Marr Prize in Computer Vision and a National Science Foundation CAREER Award.