Efficient invocation of short data intensive jobs


Overview

Invocation of a single job in Condor is a complicated and relatively long process. Its overhead can not be neglected compared to the running time of a short job.There are a lot of factors contributing to job invocation process, among them pre-staging of data and executable, resource allocation mechanism complexity and more.

This work is comprised of 2 parts: understanding and profiling Condor job invocation mechanism, and designing and implementing a comprehensive solution to provide low overhead efficient invocation mechanism of very short jobs with high data requirements.

Common usage scenario of Condor is to run a big batch of similar jobs which differ only by arguments or input file (for example: Blast application, physical simulations). Default Condor behavior is to reuse the matched resource for multiple invocation of the same job. This optimization allows reducing the matching overhead. On the other hand the overhead of invocation process on the matched resource is repeated every single invocation. This includes executable/input staging, job cleanup and output staging back from the resource to the submission machine.

We are proposing a solution which will take full advantage of similarities between the jobs using the same executable and common input files, in order to speedup the invocation process (and as a result of that the overall performance)


Solution design

The chosen solution implemented Master-Worker paradygm.

User submits his jobs to special daemon a "Master". Master, instead of submitting to Condor a cluster of a lot of similar jobs, submits a small number of agent (worker) jobs to condor by regular condor_submit command. Condor will execute these special agent jobs just the same way it executes the jobs in the cluster. When agent starts running on execution machine it connects its master and starts to receive jobs to run. Executable and input files that are common to all user jobs are transferred only once. Agent can execute original jobs one by one on the execution machine and while doing so it can parallely transfer back to master the results of previous jobs and immediately receive the next one (by that optimizing the I/O transfer process). Special recovery algorithm was develop in case some agent fails (its machine crashes).

Detailed project description and a short Presentation could be downloaded from here.


Status

1.10.2003 This project is complete. We implemented an external module providing the required functionality and performed the performance analysis, which has shown major run time decrease of big clusters of short length jobs. Master-agent approach is extremely efficient on jobs of up to 20 seconds length with big common input files. Run time speedup for such jobs moves between 2 up to 8 times. More detailed results can be found in an attached project description.

Download

Installation procedure

Configuration parameters

Sources

Linux executables (condor version 6.4.0,linux-glibc2.3,dynamic (tested on RH8)).


Contact

Students

Gabi Kliot - http://www.cs.technion.ac.il/~gabik gabik-at-cs.technion.ac.il

Noam Palatin - noamp-at-tx.technion.ac.il

Supervisors

Mark Silbershtein

Prof. Assaf Schuster