2025-2026 Faculty Research Awards

Project Summary

Current vision-language models (VLMs) are proficient with static 2D images, but struggle when dealing with the complexities of 3D and 4D scenes. Their limitations in spatial reasoning, understanding of physics (such as predicting interactions), and counterfactual exploration scenarios, along with issues related to interpretability challenges, restricts their effectiveness in real-world applications. Our research introduces a new generation of VLMs that use explicit 3D and 4D world representations. We aim to develop models that inherently incorporate scene structure, object geometries, physical attributes, and physics-based principles. This integrated approach enables the VLM to not only answer detailed factual queries about dynamic scenes but also to forecast future states based on physical laws and explore counterfactual scenarios involving modifications to objects or dynamics.

Alan Yuille

Bloomberg Distinguished Professor of Cognitive Science and Computer Science

Project Summary

This project introduces novel research directions focused on advancing speech synthesis through multimodal machine learning. There are various ways multimodal learning can improve speech synthesis systems. By integrating information from multiple modalities—such as text, images, and gestures—into the synthesis process, these systems can generate more natural and expressive speech. In this project, we will explore new methods to leverage computer vision and large language models (LLMs) to improve the expressiveness and naturalness in synthesized speech. Our research will also propose novel evaluation methods for assessing the performance of synthesized speech in terms of expressiveness and intelligibility.

Berrak Sisman

Assistant Professor, Department of Electrical and Computer Engineering

Project Summary

Our project introduces Data Kernel Perspective Space (DKPS) as a framework to provide performance guarantees for large language models (LLMs) using theory, methods, and practical applications. Our approach is based on a low-dimensional Euclidean space representation of the embedded model responses. Our representation of the collection of models provides the foundation for mathematical analysis, resulting in concrete statistical guarantees for LLM performance on computationally demanding evaluation tasks. The proposal to use DKPS for a Gaussian mixture representation offers applicability across a wide range of LLM inference tasks and can be applied to a wide variety of black-box models. Most importantly, it allows for multimodal models via general representation-space fusion methods.

Carey Priebe

Professor, Department of Applied Mathematics and Statistics and Director, Mathematical Institute for Data Science

Project Summary

The field of inference-scaling has branched into two distinct research directions. The first direction focuses on scaling inference in large language models (LLMs), allowing these models to generate “textual thoughts” and engage in iteratively reasoning based on prior thoughts. In contrast, the second direction explores vision-based inference-scaling where models create “visual thoughts” by simulating how the world might change as a result of specific actions. Although progress has been made in each area, these approaches have developed largely in isolation from one another. Our proposed work aims to bridge this divide by creating a unified framework that integrates both language-based and vision-based inference-scaling.

Daniel Khashabi

Assistant Professor, Department of Computer Science

Project Summary

Retrieval-Augmented Generation (RAG) operates by first retrieving relevant documents and then generating text based on the retrieved context. Therefore, efficiently processing this context to augment generation is critical for both speed and quality. RAG architectures augment large language models (LLMs) but encounter significant latency and memory challenges at scale. While recent approaches attempt to reduce these costs through single-token representations of retrieved content, such coarse compression may not preserve the nuanced semantic information necessary for high-quality generation. Our proposal seeks to improve RAG systems by introducing a multi-vector context compression framework that addresses limitations in current single-vector compression methods.

Mahsa Yarmohammadi

Assistant Research Scientist, Center for Language and Speech Processing

Project Summary

A major challenge in artificial intelligence is the integration of symbolic and intuitive reasoning. For example, natural language questions that involve information stored in structured tabular format require both language understanding—to map between variance in linguistic expression—and computation over structured data. Our project will develop novel benchmarks for such questions and explore several methods to address them. These methods will include techniques that have proven effective for other reasoning tasks, as well as a novel approach that combines reinforcement learning with code generation. We will develop two test sets and investigate a range of approaches to table question answering.

Philipp Koehn

Professor, Department of Computer Science

Project Summary

With advances in foundation models, there is growing interest in engineering open-ended AI agents capable of reasoning and planning across diverse scenarios, including web-based, OS-based, and embodied tasks. Currently, AI agents driven by large language models (LLMs) or vision-language models (VLMs) focus on executing a single task based on a specific instruction. However, to safely and effectively assist humans, AI agents must observe and interact over extended periods, tracking human mental states and following sequential instructions. This realistic human-AI interaction poses challenges for agents’ reasoning and planning abilities in long-horizon assistance tasks. In our project, we will develop AI agents capable of providing long-horizon assistance tasks, grounded in the user’s changing mental states over time. Specifically, we will focus on continuous theory of mind inference from streaming multimodal inputs and mental state guided proactive AI assistance and communication.

Tianmin Shu

Assistant Professor, Department of Computer Science

Project Summary

Large language models (LLMs) have significantly advanced information retrieval through deep search, a process in which autonomous systems retrieve and synthesize information from diverse and heterogeneous data sources. By leveraging their ability to model cross-document relationships, LLMs enable the generation of comprehensive, context-aware responses to complex queries. While these systems have performed well in general-purpose applications, such as online search, significant challenges remain when applying deep search to specialized expert domains, which often involve proprietary and structured data. To address these limitations, we propose a precision AI framework for enhancing deep search in expert domains through a domain-aware, graphical user interface (GUI)-driven multi-agent architecture. Our approach introduces a coordinated system of specialized agents that work together to interpret user queries, retrieve and synthesize information from structured and unstructured sources, navigate graphical interfaces, and generate accurate, context-aware outputs.

Yanxun Xu

Associate Professor, Department of Computer Science

Next-Generation 3D Vision-Language Models with Physics, Counterfactual Reasoning & Planning

Principal Investigator

Multimodal Speech Synthesis: Leveraging Computer Vision and LLMs for Expressive Voice Generation

Principal Investigator

Representing a Collection of Large Language Models as a Gaussian Mixture in Data Kernel Perspective Space

Principal Investigator

Multimodal Inference-Scaling

Principal Investigator

Enhancing Retrieval-Augmented Generation with Multi-Vector Representation

Principal Investigator

Reasoning over Information in Tabular Data

Principal Investigator

Long-horizon AI Assistance via Continuous Mental Reasoning from Streaming Multimodal Inputs

Principal Investigator

Precision AI: Enhancing Deep Search for Expert Knowledge Domains with Multi-agent GUI Systems

Principal Investigator