Installing replication features in DSL pool ------------------------------------------- 1. Check out the latest version from CVS's V6_7-Technion_ha5-branch 2. Configure it, compile the version and put the binaries inside /usr/local/condor/glibc23/condor-6.7.14 3. On both ds-ibm1 and picasso machines: * In /etc/init.d/condor comment out the below line CONDOR_VERSION=6.7.13 and add the following line CONDOR_VERSION=6.7.14 * In /etc/profile.d/condor.sh comment out the below line CONDOR_VER=6.7.13 and add the following line CONDOR_VER=6.7.14 * In /etc/profile.d/condor.csh comment out the below line set CONDOR_VER="6.7.13" and add the following line set CONDOR_VER="6.7.14" these are necessary for startup scripts/various profile definitions of sysadmins 4. In /usr/local/condor/etc/{ds-ibm1,picasso}.local files inside ds-ibm1 add the below lines at the end of the file: RELEASE_DIR = /usr/local/condor/glibc23/condor-6.7.14 REPLICATION = $(SBIN)/condor_replication REPLICATION_ARGS = -p 61450 HAD_USE_REPLICATION = true MASTER_NEGOTIATOR_CONTROLLER = HAD REPLICATION_LIST=picasso.cs.technion.ac.il:61450,ds-ibm1.cs.technion.ac.il:61450 NEGOTIATOR_STATE_FILE=$(SPOOL)/Accountantnew.log REPLICATION_INTERVAL=25 HAD_ALIVE_TOLERANCE=150 MAX_TRANSFER_LIFETIME=10 NEWLY_JOINED_WAITING_VERSION_INTERVAL=5 HAD_UPDATE_INTERVAL=300 REPLICATION_LOG = $(LOG)/ReplicationLog TRANSFERER_LOG = $(LOG)/TransfererLog MAX_REPLICATION_LOG = 64000000 MAX_TRANSFERER_LOG = 64000000 REPLICATION_DEBUG = D_FULLDEBUG TRANSFERER_DEBUG = D_FULLDEBUG DC_DAEMON_LIST = MASTER, STARTD, SCHEDD, KBDD, COLLECTOR, NEGOTIATOR, EVENTD, HAD, REPLICATION DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, HAD, REPLICATION Problems during installation ---------------------------- There was a problem of creating a transferer process. The error that appeared in the replication daemon log was as follows: * Create_Process: child failed because PRIV_UNKNOWN process was still root before exec() The problem appeared because Condor does not allow a process be created by user, the real uid of which is the root's one (i.e. Condor does not use 'geteuid' function, it uses 'getuid' function instead) This is exactly what happened in DSL pool, because Condor processes ran under effective user, called 'condor', whereas the real user was 'root'. Solution -------- 1. Add NIS group "condor" with GID='id -n condor' 2. Change condor's default group to "condor" 3. Rebuild NIS database. 4. Chown -R condor:condor /home/condor 5. Change condor init script to run the Condor daemons with 'condor' user by su - condor -c /path/to/condor_master