Low Latency Invocation in Condor

Project overview

Condor is a distributed batch system for sharing the workload of compute-intensive jobs in a pool of workstations or PCs. In such a system an invocation of a single job is a complicated and relatively long process which overhead can not be neglected compared to the running time of a short job.

There are a lot of factors contributing to job invocation process, among them pre-staging of data and executable, data dependency, resource allocation mechanism complexity and more.

This work is comprised of 2 parts: understanding and profiling Condor job invocation mechanism and designing and implementing a comprehensive solution - “Master-Agents” low latency invocation protocol.

 

Common usage scenario of Condor is to run a big batch of similar jobs which differs only by arguments or input file (for example: Blast application, physical simulations …). Default Condor behavior is to run each job independently from others. On the other hand the overhead of invocation process in Condor is high (including matching, claiming, executable/input transfer, etc…) We are proposing a solution which will take full advantage of jobs' similarities (including similarity of their input demands) and speedup the invocation process (and as a result of that the overall batch run timer).

 

The general idea of Master-Agent based invocation is that instead of submitting to Condor a cluster of a lot of similar jobs, a small number of agent jobs will be submitted. Condor will execute these special agent jobs just the same way it executes the jobs in the cluster. When arriving at execution machine agent can execute original jobs one by one on the execution machine, while optimizing the I/O transfer process.

General flow

User submits his jobs to special daemon – a “Master”. Master submits some number of agents to condor by regular condor_submit command. When agent starts running on execution machine it connects its master and starts to receive jobs to run. Executable and input files that are common to all user jobs are transferred only once. While agent executes one job it can parallely transfer back to master the results of previous job and immediately receive the next one. Special recovery algorithm was develop in case some agent fails (its machine crashes)

Results

Performance analysis has shown major run time decrease of big clusters of short length jobs.

Master-agent approach is extremely efficient on jobs of up to 20 seconds length with big common input files. Run time speedup for such jobs moves between 2 up to 8 times. More detailed results can be found in an attached project description.

Detailed project description

Presentation

Installation procedure

Configuration parameters

Sources

Linux executables

Contact Gabi Kliot or Noam Palatin with any inquiries.