Low Latency Invocation in Condor

Project overview
Condor is a distributed batch system for sharing the workload of
compute-intensive jobs in a pool of workstations or PCs. In such a system an
invocation of a single job is a complicated and relatively long process which
overhead can not be neglected compared to the running time of a short job.
There are a lot of factors contributing to job invocation process, among
them pre-staging of data and executable, data dependency, resource allocation
mechanism complexity and more.
This work is comprised of 2 parts: understanding and profiling Condor
job invocation mechanism and designing and implementing a comprehensive solution
- “Master-Agents” low latency invocation protocol.
Common usage scenario of Condor is to run a big batch of similar jobs which differs only by arguments or input file (for example: Blast application, physical simulations …). Default Condor behavior is to run each job independently from others. On the other hand the overhead of invocation process in Condor is high (including matching, claiming, executable/input transfer, etc…) We are proposing a solution which will take full advantage of jobs' similarities (including similarity of their input demands) and speedup the invocation process (and as a result of that the overall batch run timer).
The general idea of Master-Agent based invocation is that instead of submitting to Condor a cluster of a lot of similar jobs, a small number of agent jobs will be submitted. Condor will execute these special agent jobs just the same way it executes the jobs in the cluster. When arriving at execution machine agent can execute original jobs one by one on the execution machine, while optimizing the I/O transfer process.
General flow
User submits his jobs to special daemon – a “Master”. Master submits some number of agents to condor by regular condor_submit command. When agent starts running on execution machine it connects its master and starts to receive jobs to run. Executable and input files that are common to all user jobs are transferred only once. While agent executes one job it can parallely transfer back to master the results of previous job and immediately receive the next one. Special recovery algorithm was develop in case some agent fails (its machine crashes)
Results
Performance analysis has shown major run time decrease of big clusters of short length jobs.
Master-agent approach is extremely efficient on jobs of up to 20 seconds length with big common input files. Run time speedup for such jobs moves between 2 up to 8 times. More detailed results can be found in an attached project description.
Contact Gabi Kliot or Noam Palatin with any inquiries.