Technical Report CS-2006-09

TR#:CS-2006-09
Class:CS
Title: Mining for Misconfigured Machines in Grid Systems
Authors: Noam Palatin, Assaf Schuster, and Ran Wolff
PDFCS-2006-09.pdf
Abstract: Grid systems are proving increasingly useful for managing the batch computing jobs of organizations. One well known example for that is Intel which uses an internally developed system called NetBatch to manage tens of thousands of machines. The size, heterogeneity, and complexity of grid systems are extreme. Therefore, these systems are very difficult to configure. This often results in part of the machines being inadequately set-up. Such misconfigured machines can have adverse effects on the entire system.

We investigate a distributed data mining approach for detection of misconfigured machines. Our Grid Monitoring System (GMS) non-intrusively collects data from all available sources (log files, system services, etc.) available throughout the grid system. It converts raw data to data with ontological meaning and stores the resulting data on the machine it was obtained from; thus, limiting incurred overhead and allowing scalability. Following, when analysis is requested, a distributed outliers detection algorithm is employed to identify misconfigured machines. The algorithm itself is implemented as a recursive workflow of grid jobs. It is especially suited to deal with the conditions typifying grid systems, in which one can expect machines to rarely be available and to often fail altogether.

We exemplify that our distributed data mining approach is indeed beneficial by using GMS to analyze the data on a large Condor pool. Of the four most outlied computers identified by the system three were indeed misconfigured and one apparently had a temporal problem that we could not recreate. Further investigation prove our approach is highly scalable, and suitable for large grid systems in which every pool may have thousands of computers.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2006/CS/CS-2006-09), rather than to the URL of the PDF or PS files directly. The latter URLs may change without notice.

To the list of the CS technical reports of 2006
To the main CS technical reports page

Computer science department, Technion