Published: 
By  The Office of Communications at the UVA School of Engineering and Applied Science
Chang Lou
Chang Lou, assistant professor of computer science. Photo by Tom Cogill.

Cloud systems are the foundation of modern life, powering everything from businesses to daily online activities. Keeping them reliable is critical, but traditional approaches to managing cloud failures are struggling to keep up with new types of problems. 

Addressing Hidden Failures

One key challenge is "silent semantic failures," where a system behaves incorrectly but doesn’t send out any error alerts. These failures often slip through monitoring systems unnoticed, but cause significant disruptions like widespread e-commerce outages, interrupted financial transactions, or compromised healthcare systems reliant on cloud-based data. The resulting financial losses can reach billions of dollars, underscoring the urgent need for more effective solutions.

Chang Lou, an assistant professor in the Department of Computer Science at the University of Virginia, is tackling the challenge with support from his new NSF CAREER grant. His innovative approach reimagines how cloud systems detect, diagnose and prevent failures by integrating automated tools directly into the development process. By simplifying and automating the creation of failure-checking tools, Lou’s work could make cloud systems more resilient and dependable, even in the face of complex and emerging challenges.

Our ultimate goal is to give developers powerful tools to safeguard the systems we all depend on.

“Cloud systems are the backbone of everything we rely on in the digital age, from financial markets to healthcare,” Lou said. “The stakes are incredibly high, and with this NSF CAREER grant, we’re focused on building a framework that not only detects hidden failures but also provides assurance on correct execution. It also ensures issues are fixed quickly and prevented from happening again. Our ultimate goal is to give developers powerful tools to safeguard the systems we all depend on.”

Closing the Gaps in Cloud Failure Detection and Prevention

Currently, cloud systems mostly rely on statistical methods like analyzing logs and resource usage to spot issues. However, these methods often miss silent failures, highlighting the need for a new way to monitor and verify systems in real time. Existing solutions for runtime checking require developers to manually write detailed failure-checking tools — a slow and error-prone process that is rarely practical for large-scale systems. As a result, these tools are rarely used in real-world systems, putting critical infrastructure and economies at risk.

Lou’s research aims to solve this by detecting hidden failures early, helping developers pinpoint their root causes, testing fixes in a safe environment to avoid disruptions, and learning from past failures to prevent future problems. These steps work together to ensure cloud systems stay reliable, minimize risks and reduce costly downtime.

The broader goal is to make cloud systems far more reliable. The researchers will collaborate with industry leaders like Microsoft and Amazon to test and refine these methods in real-world settings.

About the Grant

Lou’s research is supported by the National Science Foundation through a prestigious CAREER award, granted by the Division of Computer and Network Systems within the Directorate for Computer and Information Science and Engineering. The grant totals $713,338 over five years.