This work is comprised of 2 parts: understanding and profiling Condor job invocation mechanism, and designing and implementing a comprehensive solution to provide low overhead efficient invocation mechanism of very short jobs with high data requirements.
Common usage scenario of Condor is to run a big batch of similar jobs which differ only by arguments or input file (for example: Blast application, physical simulations). Default Condor behavior is to reuse the matched resource for multiple invocation of the same job. This optimization allows reducing the matching overhead. On the other hand the overhead of invocation process on the matched resource is repeated every single invocation. This includes executable/input staging, job cleanup and output staging back from the resource to the submission machine.
We are proposing a solution which will take full advantage of similarities between the jobs using the same executable and common input files, in order to speedup the invocation process (and as a result of that the overall performance)
The chosen solution implemented Master-Worker paradygm.
User submits his jobs to special daemon a "Master". Master, instead of submitting to Condor a cluster of a lot of similar jobs, submits a small number of agent (worker) jobs to condor by regular condor_submit command. Condor will execute these special agent jobs just the same way it executes the jobs in the cluster. When agent starts running on execution machine it connects its master and starts to receive jobs to run. Executable and input files that are common to all user jobs are transferred only once. Agent can execute original jobs one by one on the execution machine and while doing so it can parallely transfer back to master the results of previous jobs and immediately receive the next one (by that optimizing the I/O transfer process). Special recovery algorithm was develop in case some agent fails (its machine crashes).
Detailed project description and a short Presentation could be downloaded from here.
Linux executables (condor version 6.4.0,linux-glibc2.3,dynamic (tested on RH8)).
Gabi Kliot - http://www.cs.technion.ac.il/~gabik gabik-at-cs.technion.ac.il
Noam Palatin - noamp-at-tx.technion.ac.il
Mark Silbershtein
Prof. Assaf Schuster