Technical Report MSC-2013-18

Title: Unsupervised Anomaly Detection in Large Datacenters
Authors: Moshe Gabel
Supervisors: Assaf Schuster, Ran Gilad-Bachrach
Abstract: Unexpected machine failures, with their resulting service outages and data loss, pose challenges to datacenter management. Complex online services run on top of datacenters that often contain thousands of machines. With so many machines, failures are common, and automatic monitoring is essential.

Many existing failure detection techniques do not adapt well to the unpredictable and dynamic environment of large-scale online services. They rely on static rules, obsolete historical logs or costly (often unavailable) training data. More flexible techniques are impractical, as they require on deep domain knowledge, unavailable console logs, or intrusive service modifications.

We hypothesize that many machine failures are not a result of abrupt changes but rather a result of a long period of degraded performance. This is confirmed in our experiments on large real-world services, in which over 20% of machine failures were preceded by such latent faults.

We propose a proactive approach to failure prevention by detecting performance anomalies without prior knowledge about the monitored service. We present a novel framework for statistical latent fault detection using only ordinary machine counters collected as standard practice. The main assumption in our framework is that that at any point in time, most machines function well. By comparing machines to each other, we can then find those machines that exhibit latent faults.

We demonstrate three detection methods within the framework, and apply them to several real-world production services. The derived tests are domain-independent and unsupervised, require neither background information nor parameter tuning, and scale to very large services. We prove strong guarantees on the false positive rates of our tests, and show how they hold in practice.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2013
To the main CS technical reports page

Computer science department, Technion