Adding High Availability and Replication to Condor Central Manager


Overview

Condor pool, compirised with execution and submission machines, usually runs a single Central Manager (CM) or matchmaker, serving as a broker between resources and jobs, finding the best available resource for a given job. CM is the ''brain'' of Condor pool and its proper functioning is critical for achieving maximum utilization of the available pool resources. Its failure leads to Condor inability to match new jobs submitted to the pool, leading to underutilization of the resources. Furthermore, Condor tools used to obtain information about pool functioning, such as current status of pool resources and usage statistics will fail to execute.

We have designed a High Availability Daemon (HAD) to eliminate the problem of single-point-of-failure of CM in Condor pool. In the new setting Condor pool may have several CMs, augmented with HAD, which ensures that only one of them will be active in every given instant. HAD provides transparent fail-over mechanism by automatically detecting failure of the working CM and by allowing the pool resources to use another CM. The important feature of our design is that it is external to the service being made highly available. That is, no change is made to the matchmaker daemons, incapsulating all the complexity inside HAD. Some minor changes are required to be made to other pool components and user tools. The main design criteria of HAD are transparency of fail-over for pool resources and pool users, simplicity of the protocols, low communication overhead and minimization of changes to the original Condor code.

CM is a statefull service. Therefore, in addition to ensuring that there is always a single active CM in the pool, this state should not be lost in face of failures and an active CM must be provided with an updated and a consistent state. Condor CM's state encapsulates the information about Condor pool resources sharing and users priorities and is stored in Accountantnew.log file. This file resides on the machine running the negotiator daemon, and is being updated once upon some period of time. Loss of this file after negotiator's machine failure will result in improper handling of pool's fairness policy.

As part of the solution we have designed and developed another daemon to take care of the replication and data consistency issues. We have introduced a replication daemon, which runs on each backup machine. The replication daemons on different machines comprise a replication subsystem within the pool. The replication daemon, residing on the active HAD machine, is called the replication leader and CM's state file is transferred by it to other replication daemons, called replication backups. This daemons redundancy enables us to have an up-to-date copy of the crucial file upon active CM machine failure on every backup machine.


Solution design

High Availibility Deamon design

In case of failure of Condor matchmaker our solutions aims to start a new CM. In case of network partitions new CM will be started in each part of the network. We identified a number of issues to be addressed in order to provide high availability to Condor CM.

Replication Deamon design

The most important feature of the replication system design is a separation between a replication mechanism (such as how and when to send a state replica to backup machines) and a consistency policy (e.g., which replica is the most update one, which replica is a later version of a previous replica and which replica can not be consistently compared with other replicas). We provide both a replication mechanism and an implementation of a default consistency policy. A flexible and a generic design of a replication daemon allows an easy development and a transparent deployment of additional consistency policies. All consistency related decisions are encapsulated in a special policy module which can be substituted by another module transparently to the replication mechanism and to other Condor components.

Other highlights of the replication system design include:

Note that since one of the requirements of the replication system was to treat the replicated state as a black box, our replication system is not aware of the actual semantics of the state file. As a result, our default consistency policy, provided in the current solution, may fail to properly handle the specific case when the whole pool is turned off, restarted later and a new replication daemon, that did not run before, is added. In order to establish a causal order of the state version numbers, at least one replication daemon must be alive at any given instance of time in the replication subsystem. In case a new replication daemon, that did not run before, is added in the same event with restarting all other replication daemons, there is no way to establish such a causal order between the latest version numbers of this new daemon and previously running daemons. As a result, the algorithm might fail to choose the latest version of the state file from previously running daemons and instead choose a version of a newly added daemon. In such case an active CM will use a more obsolete version of the state file (the one of the newly added daemon).

Solutions:


Status

Highly Available Central Manager project has been released as a part of official Condor 6.7.6 development release. See the Version History and Release Notes for details on 6.7.6 release.

Replication project project has been released as a part of official Condor 6.7.18 development release and together with Highly Available Central Manager project is currently part of the stable 6.8 release. See the Version History and Release Notes for details on 6.7.18 release.

HA in production


Downloads

Condor

Condor is available from Downloads page.

Condor manual documentation (v 6.7.18) is available here.

Design Documents

The preliminary HA working draft (outdated) can be found here.

Full design document of the replication system architecture is available here.

Presentations

The presentation  "Highly Avaliable Central Manager" presented during the Condor Week 2005  is available here (.ppt, .pdf).

The presentation  "Recent Developments in Highly Avaliable Central Manager" presented during the Condor Week 2006  is available here (.ppt, .pdf).

The HPDC-2006 hot topic paper is available here.

Configuration

The configuration tutorial of HA system presented during the Condor Week 2005 is available here (.ppt, .pdf).

Sample configuration file for central manager machine running HAD is available here. Sample configuration file for client (schedd, startd) machine running HAD is available here.

The configuration tutorial of both HA and Replication presented during the Condor Week 2006 is available here (.ppt, .pdf).

Sample configuration file for replication machine is available here.

Configuration sanity check script is available here.


Contact

Gabi Kliot - gabik-at-cs.technion.ac.il
Artyom Sharov - sharov-at-cs.technion.ac.il
Mark Silberstein - marks-at-tx.technion.ac.il

Students

Svetlana Kantorovitch, Dedi Carmeli, Boris Mudrik, Oleg Yudayev

The prototype of this project was done by Alex Gantman and Yvgeny Chaplinsky

Supervisors

Mark Silberstein, Gabi Kliot

Prof. Assaf Schuster