
Imagine building an autonomous drone, but the only way to see whether it avoids crashing is to send it flying. Or training a medical AI assistant and having to rely on real patients to test if it gives safe advice.
That’s the challenge facing today’s artificial intelligence developers: to evaluate whether an AI system works, you often have to put it into complex, unpredictable environments where mistakes can be costly — or dangerous.
Shangtong Zhang, the Alf Weaver Assistant Professor in the Department of Computer Science at the University of Virginia’s School of Engineering and Applied Science, has earned a National Science Foundation CAREER Award to rethink how AI agents — decision-making programs that act autonomously within an environment — are evaluated and improved. His grant project, “Revolutionizing the Evaluation of AI Agents with Online and Offline Data,” aims to create methods that are safer, faster and more scalable than current approaches while opening the door to broader participation in AI development.
There’s a growing need to understand what AI agents are truly capable of without having to learn the hard way.
“There’s a growing need to understand what AI agents are truly capable of without having to learn the hard way,” Zhang said.
Today, testing often means deploying AI systems in actual operating environments, often at limited scale, to observe how they behave under actual conditions. That could involve expensive robotic platforms, time-consuming simulations, or human-in-the-loop evaluations where people must oversee the AI and intervene as needed. In safety-critical applications such as health care, aviation or autonomous navigation, every test carries potential consequences. If an AI fails even in a simulation — say, a virtual drone crashes into an object — it may expose a vulnerability that would be catastrophic in real deployment. Yet not every failure is obvious or immediate, and some flawed systems are still moved forward because the failures seem minor or context-dependent, like a drone’s obstacle detection system struggling in low-light conditions.
These risks aren’t only a challenge — they are also a huge bottleneck. As AI systems grow more powerful and complex, traditional testing methods can’t match the speed of development or the variety of situations that agents must handle. Progress stalls while researchers wait for test results, review human feedback or reset expensive equipment. Zhang’s research offers a different path: smarter, faster algorithms that estimate performance with far less dependence on real-time, high-risk deployments in costly or sensitive environments. By reducing wait times and limiting the need for constant hands-on testing, his methods help AI development keep pace with innovation.
The heart of his work lies in combining the strengths of offline data — such as previously collected test runs or simulations — with carefully chosen online tests that maximize insight. This fusion allows developers to evaluate agents more accurately and efficiently while sidestepping the resource burden of constant real-time testing.
Zhang’s team is advancing three tightly connected research strategies:
- Leverage past data to test new systems more efficiently
Zhang reimagines Monte Carlo methods, a class of statistical tools used to estimate AI performance, by using previously collected offline data to guide new test samples. This approach dramatically reduces the number of expensive or risky evaluations needed to understand how well an AI system is performing overall.
- Dive deeper into AI agents’ behavior
Rather than providing only a high-level summary of performance, Zhang’s work in value function learning — an approach that estimates the long-term benefit of taking certain actions in different situations — enables a more comprehensive view of how AI agents behave. This is critically important because it moves beyond a simple scalar metric — a single-number score like accuracy or reward total — to a deeper understanding of an agent's decision-making process. This granular insight is essential for building safer and more reliable AI by revealing the specific conditions under which an agent might fail.
- Use fewer people to collect useful human feedback
Some AI agents rely on human input, such as ranking responses or judging behavior, to learn what constitutes success. But gathering this feedback can be slow and labor-intensive. Zhang is developing reward modeling techniques that deliver high-quality guidance using fewer human ratings, helping to lower costs and streamline training. These models can learn patterns from a small number of human preferences and generalize them to similar scenarios, enabling agents to improve while still aligning with what people value — such as clarity, safety or helpfulness. This reduces the burden on human testers without sacrificing quality or oversight.
We can make AI development safer, faster and more accessible.
“By reducing the need for constant oversight, physical testing and massive datasets, we can make AI development safer, faster and more accessible to a wider range of researchers and organizations,” Zhang said.
These innovations do more than speed up development — they also make the entire process more scalable. Instead of building large teams or deploying complex hardware to test every iteration, developers can reuse data, prioritize high-impact scenarios and integrate meaningful feedback with minimal friction. That opens the door for smaller labs, startups or new academic groups to contribute to AI research without needing huge infrastructure.
In addition to advancing the technical state of the field, Zhang’s CAREER Award project also supports educational goals. He plans to integrate these research activities into graduate and undergraduate courses and provide mentorship opportunities that help prepare students for careers at the forefront of AI and machine learning.
Zhang joined UVA in 2022 and leads the Sequential Intelligence Lab. He earned a doctorate from the University of Oxford, a master’s degree in science from the University of Alberta and a bachelor’s degree in science from Fudan University.