Adding High Availability and Replication to Condor Central Manager
Overview
Condor pool, compirised with execution and submission machines, usually runs
a single Central Manager (CM) or matchmaker, serving as a broker between
resources and jobs, finding the best available resource for a given
job.
CM is the ''brain'' of Condor pool and its proper functioning
is critical for achieving maximum utilization of the available pool
resources. Its failure leads to Condor inability to match new jobs
submitted to the pool, leading to underutilization of the resources.
Furthermore, Condor tools used to obtain information about pool functioning,
such as current status of pool resources and usage statistics will
fail to execute.
We have designed a High Availability Daemon (HAD) to eliminate the
problem of single-point-of-failure of CM in Condor pool. In the new
setting Condor pool may have several CMs, augmented with HAD, which
ensures that only one of them will be active in every given instant.
HAD provides transparent fail-over mechanism by automatically detecting
failure of the working CM and by allowing the pool resources to use
another CM. The important feature of our design is that it is external
to the service being made highly available. That is, no change is
made to the matchmaker daemons, incapsulating all the complexity inside
HAD. Some minor changes are required to be made to other pool components
and user tools.
The main design criteria of HAD are transparency of fail-over for
pool resources and pool users, simplicity of the protocols, low communication
overhead and minimization of changes to the original Condor code.
CM is a statefull service. Therefore, in addition to ensuring that there is always
a single active CM in the pool, this state should not be lost in face of failures
and an active CM must be provided with an updated and a consistent state.
Condor CM's state encapsulates the information about Condor pool resources sharing and
users priorities and is stored in Accountantnew.log file.
This file resides on the machine running the negotiator daemon, and is being
updated once upon some period of time. Loss of this file after negotiator's
machine failure will result in improper handling of pool's fairness policy.
As part of the solution we have designed and developed another daemon to take care of the replication
and data consistency issues.
We have introduced a replication daemon, which runs on each backup
machine. The replication daemons on different machines comprise a replication
subsystem within the pool. The replication daemon, residing on the active HAD
machine, is called the replication leader and CM's state file is transferred by it to other replication daemons,
called replication backups. This daemons redundancy enables us to have an
up-to-date copy of the crucial file upon active CM machine failure on every
backup machine.
Solution design
High Availibility Deamon design
In case of failure of Condor matchmaker our solutions aims to start a new CM.
In case of network partitions new CM will be started in each part of the network.
We identified a number of issues to be addressed in order to provide
high availability to Condor CM.
- Reliable detection of CM failure. Many different types of failures should be handled.
- Failover to backup CM in the presence of such a failure. This involves making the service available again.
- Ensuring that backup CMs have valid and up-to-date state, required for proper functioning.
- Making Condor components become aware of dynamic change of CM. The
challenge is to avoid pool-wide synchronous communications to achieve this.
- Automatic handling of ''split brain'' reconciliation, i.e. conflict
resolution between multiple active CMs. This can occur after the network
connectivity between partitioned parts of the pool is restored. Due
to our requirement to invoke CM for every partition, this would cause
a pool to be managed by more than single CM, and should be avoided.
- Deployment simplicity. This issue is particularly important for large
pools, where configuration and setup of Condor are usually quite complex.
Replication Deamon design
The most important feature of the replication system design is a separation
between a replication mechanism (such as how and when to send a
state replica to backup machines) and a consistency policy (e.g., which replica
is the most update one, which replica is a later version of a previous replica
and which replica can not be consistently compared with other replicas). We
provide both a replication mechanism and an implementation of a default
consistency policy. A flexible and a generic design of a replication daemon
allows an easy development and a transparent deployment of additional
consistency policies. All consistency related decisions are encapsulated in a
special policy module which can be substituted by another module transparently
to the replication mechanism and to other Condor components.
Other highlights of the replication system design include:
- The replication system can be used for replication of any data in
the pool. Although current version is only meant for replication of CM's
state file, it can also be used for replication of the state of other
Condor daemons, e.g., schedd.
- Various instances of replication subsystems, with different
consistency policies, can coexist in the pool.
- Every High Availability Daemon (HAD) has a matching replication
daemon.
- The replication daemon architecture is semi-decoupled from the HAD
architecture, i.e. the change of HAD state affects the change of
replication daemon state, but not the vice versa. This makes HAD robust
to potential replication daemon's failures.
- The implementation of a replication mechanism possesses the
following properties:
- The file transfer is performed by separate processes, called
transferers, which are created when the need for transfer arises. This does not stall the
replication daemon and allows it to serve additional requests even
during massive file
transfer.
- Robustness - dealing with stuck transfereres: the replication daemon
manages the set of transferers, currently running on its machine. If
some transferer's running time exceeds a predefined time, it is
being killed by the local replication daemon.
- File transfer integrity: the files are transferred along with its MAC
(Message Authentication Codes).
- Transactional transfer: in case a state is comprised of a number of
files they are being transferred in transactional way, i.e. if one of
the files failed to pass properly, the entire process of transferring
multiple files is considered a failure (we use it for a transactional
transfer of a state file and a version file).
- The default implementation of a consistency policy, provided
in the current solution, possesses the following properties:
- The replication daemon is implemented as a state machine. In any
given instance of time it can either be a replication leader, a
replication backup or a joining machine. Replication leader is the one
which is running along with the active HAD.
- Every replica of the state file is accompanied by a version number.
This version number is stored in an additional version file, in order to
properly deal with failures and restarts of the replication daemon
itself.
- Only replication leader is allowed to increment the version
number of the state file. When it detects that the last known modification time of the
state file has been changed, it increments
the version number of its replica of the state file and broadcasts it
(along with the version) to all backups in the pool.
- The version numbers are causally ordered. That way we establish a
causal relationship between the state files and enforce the "pick the
latest" version algorithm.
- Support
for the ''split brain'' reconciliation. When a situation of multiple
active CMs in the pool is detected, a giving-up replication daemon will send
a replica of its state to other replication daemons in the pool, in
order to allow an active replication daemon to merge the
files of two previously split networks.
Note that since one of the requirements of the replication system was to
treat the replicated state as a black box, our replication system is not aware
of the actual semantics of the state file. As a result, our default consistency
policy, provided in the current solution, may fail to properly handle the
specific case when the whole pool is turned off, restarted later and a new
replication daemon, that did not run before, is added. In order to establish a causal order of
the state version numbers, at least one replication daemon must be alive at any
given instance of time in the replication subsystem. In case a new replication
daemon, that did not run before, is added in the same event with restarting all
other replication daemons, there is no way to establish such a causal order
between the latest version numbers of this new daemon and previously running
daemons. As a result, the algorithm might fail to choose the latest version of
the state file from previously running daemons and instead choose a version of a
newly added daemon. In such case an active CM will use a more obsolete version
of the state file (the one of the newly added daemon).
Solutions:
- In case of restart of the whole pool and addition of another backup CM, a
system administrator must make sure that newly added replication daemon possesses
the latest version of the state file. This can be easily done by manually
coping the state file from previously running replication leader.
- The whole situation can be avoided when new replication daemons are added
strictly after (couple of minutes after) previously running daemons are
started.
- Another possible solution is by augmenting a consistency policy with some
insights about the state file itself. In such case a proper ordering or
merging of different state files will be possible.
Status
Highly Available Central Manager project has been released as a part of official Condor 6.7.6 development release.
See the
Version History and Release Notes for details on 6.7.6 release.
Replication project project has been released as a part of official Condor 6.7.18 development release and
together with Highly Available Central Manager project is currently part of the stable 6.8 release.
See the
Version History and Release Notes for details on 6.7.18 release.
HA in production
Downloads
Condor
Condor is available from Downloads page.
Condor manual documentation (v 6.7.18)
is available here.
Design Documents
The preliminary HA working draft (outdated) can be found here.
Full design document of the replication system architecture is available here.
Presentations
The presentation "Highly Avaliable Central Manager" presented during
the Condor Week 2005 is available here
(.ppt,
.pdf).
The presentation "Recent Developments in Highly Avaliable Central Manager" presented during
the Condor Week 2006 is available here
(.ppt,
.pdf).
The HPDC-2006 hot topic paper is available here.
Configuration
The configuration tutorial of HA system presented during the Condor Week 2005 is available here
(.ppt,
.pdf).
Sample configuration file for central manager machine running HAD is available here.
Sample configuration file for client (schedd, startd) machine running HAD is available here.
The configuration tutorial of both HA and Replication presented during the Condor Week 2006 is available here
(.ppt,
.pdf).
Sample configuration file for replication machine is available
here.
Configuration sanity check script is available here.
Contact
Gabi Kliot - gabik-at-cs.technion.ac.il
Artyom Sharov - sharov-at-cs.technion.ac.il
Mark Silberstein - marks-at-tx.technion.ac.il
Students
Svetlana Kantorovitch, Dedi Carmeli, Boris Mudrik, Oleg Yudayev
The prototype of this project was done by
Alex Gantman and Yvgeny Chaplinsky
Supervisors
Mark Silberstein, Gabi Kliot
Prof. Assaf Schuster