For a project I am experimenting with, I wanted to run multiple instances of a program parallely, each instance dealing with different input data. An example to demonstrate the use case: Let’s say I have a folder full of movies in wmv format, all of which I want to convert to mp4 format, using the ffmpeg program. I would like to run as many as these conversion jobs parallely, possibly on multiple computers.
Given my current exposure to big data solutions like Hadoop/Spark, I initially thought that, for e.g., we can use yarn scheduler and run these jobs on a spark cluster. But my research indicated that these use cases are not really served by yarn/spark. These tasks are not cloud computing tasks, but they are grid computing tasks.
Now, given my graduate days in University of Wisconsin, Madison, for me, grid computing means condor (currently called
HTCondor). I have used condor for my Ph.D. research, and it was awesome. So, I took a look, and gave it a try. Since I never had to set up a cluster before, here are some notes for setting up a cluster aka grid for these jobs.
Note that, this is not really a Getting Started guide; this doesn’t show, for example, sample jobs or sample submission files. For that purpose,
follow the manual.
Spark vs HTCondor
Let's get started with a small discussion on when it is appropriate to use a system like Spark vs a system like HTCondor. As mentioned before, Spark is an engine for cloud computing and HTCondor is a system for grid computing. In my mind, the difference comes from the following:
- HTCondor would let me run multiple instances of the same program on (single or) multiple nodes, each running on different inputs. Spark would let me parallelize and run one computation/algorithm on multiple nodes, from one input.
- HTCondor nodes would be appropriate when I need to run the same program on many (somewhat small) input files, possibly being accessed from a shared drive. Spark would be appropriate when the input is a massive dataset, and we can use the map-reduce paradigm for computation.
Having that out of our way, let's proceed with HTCondor.
Machines Setup
Currently using a cluster of 3 VM, each running CentOS 6. Using v8.6.0 of HTCondor, and the
manual is available here.
HTCondor Setup
Here are the commands that would install HTCondor on each of these CentOS 6 systems. To be run as root user. The original instruction
can be found here.
cd /etc/yum.repos.d/
wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo
wget http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor
rpm --import RPM-GPG-KEY-HTCondor
yum install condor-all
|
Notes:
- CentOS <version> is equivalent to RHEL <version>
- You might have to clear yum cache by using ‘yum clear all’
- Once the installation is successful, the binaries goes into /usr/bin (/usr becomes the “installation folder”), and the config files go into /etc/condor
- As part of the installation process, user & group ‘condor’ is created.
Here are the corresponding commands that would install HTCondor on debian systems (I ended up setting up on my Virtualbox VM for experimenting while on the go, and I run a debian distro there:
Bunsen Labs Linux). Again, to be run as root user. The original instruction
can be found here.
echo "deb http://research.cs.wisc.edu/htcondor/debian/stable/ jessie contrib"
>> /etc/apt/sources.list
wget -qO - http://research.cs.wisc.edu/htcondor/debian/HTCondor-Release.gpg.key
| sudo apt-key add -
apt-get update
apt-get install -t jessie condor
|
HTCondor Configuration
- Configurations are available at condor_config in /etc/condor
- One of the most important gotchas of configuring HTCondor is if the reverse DNS lookup works properly on your hosts, meaning an ‘nslookup <your ip address>’ returns your full host name. If this works, lots of trouble would be saved. Otherwise you have to add additional configs as required.
- Changed the following configurations (first two because of the above reason):
- For all boxes in the grid, including CM (Central Manager):
- condor_config: ALLOW_WRITE = *
- For all boxes in the grid, including CM (Central Manager), if you want to use it for computations:
- condor_config.local: TRUST_UID_DOMAIN=true
- For boxes other than the CM, point to the correct CM IP address, or CM host name if reverse DNS is working.
- condor_config: CONDOR_HOST = x.x.x.x
- For Debian executor boxes, blank out the BASE_CGROUP variable. More notes here.
- condor_config.local: BASE_CGROUP =
- If you do not want the CM box to run jobs, do not run the STARTD daemon. More information here.
- condor_config.local: DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
Running HTCondor on a Single Node
We will start with running HTCondor on a single node, which will eventually become our central manager.
- Start the condor daemons:
- On CentOS 6: /sbin/service condor start
- On Debian jessi: /etc/init.d/service condor start OR /etc/init.d/condor start
- Check if the correct processes are running.
[root@... condor]# ps -elf | egrep condor_
5 S condor 4200 1 0 80 0 - 11653 poll_s 14:53 ? 00:00:00 condor_master -pidfile /var/run/condor/condor_master.pid
4 S root 4241 4200 0 80 0 - 6379 poll_s 14:53 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 498
4 S condor 4242 4200 0 80 0 - 11513 poll_s 14:53 ? 00:00:00 condor_shared_port -f
4 S condor 4243 4200 0 80 0 - 16380 poll_s 14:53 ? 00:00:00 condor_collector -f
4 S condor 4244 4200 0 80 0 - 11639 poll_s 14:53 ? 00:00:00 condor_negotiator -f
4 S condor 4245 4200 0 80 0 - 16646 poll_s 14:53 ? 00:00:00 condor_schedd -f
4 S condor 4246 4200 0 80 0 - 11760 poll_s 14:53 ? 00:00:00 condor_startd -f
0 S root 4297 4098 0 80 0 - 25254 pipe_w 14:54 pts/0 00:00:00 egrep condor_
|
- Check if condor_status returns correctly, with an output similar to this.
[root@... condor]# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTim
slot1@... LINUX X86_64 Unclaimed Idle 0.080 7975 0+00:00:0
slot2@... LINUX X86_64 Unclaimed Idle 0.000 7975 0+00:00:2
Machines Owner Claimed Unclaimed Matched Preempting Drain
X86_64/LINUX 2 0 0 2 0 0 0
Total 2 0 0 2 0 0 0
|
Notes:
- If the condor_status does not return anything, it is probably because the DNS reverse lookup is not working in the system. Please follow the details in this thread, and start with setting the ALLOW_WRITE variable in the condor_config file to *.
- My CentOS6 VM is dual core with 16GB of RAM, so HTCondor automatically breaks this up into 2 slots, each having 1 core/8GB RAM.
I directly experimented with a java universe job, but a vanilla universe job can be tried as well. Once the daemons are running and condor_status returns correctly, get back to the user login and submit the job, using ‘condor_submit <job props file>’. If it takes a bit of time to run the job, condor_status would show it.
[root@akka01-samik condor]# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTim
slot1@... LINUX X86_64 Claimed Busy 0.000 7975 0+00:00:0
slot2@... LINUX X86_64 Unclaimed Idle 0.000 7975 0+00:00:2
Machines Owner Claimed Unclaimed Matched Preempting Drain
X86_64/LINUX 2 0 1 1 0 0 0
Total 2 0 1 1 0 0 0
|
The command condor_q would show it too.
[root@... condor]# condor_q
-- Schedd: ... : <x.x.x.x:42646> @ 02/24/17 14:23:43
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
samik.r CMD: java 2/24 07:38 _ 1 _ 1 2.0
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
|
Troubleshooting/Notes:
- Because of the reverse DNS problem, initially when I submitted the job, it was put on hold, and the condor_status and condor_q was showing as much. Here are the steps I followed for debugging:
- I followed the instructions available here, specifically using the ‘condor_q -analyze <job id>’ command.
- Looked at the log file StarterLog.slot1
- Another time, I got a ClassNotFoundException, since the jar file was not available at proper place. This error showed up in the output folder of the job, in the error file (mentioned in the condor submission file).
Running HTCondor on multiple nodes
Once things work for a single work, it is easy to add more nodes to the cluster. Just install the HTCondor package, change the configurations as described, and start the service.
Notes for running jobs
Some points that were not very obvious to me even after reading the instruction manual:
- In vanilla universe jobs, if the executable reads from standard input, use the input command, else use the arguments command. Consider the following snippet:
executable = testprog
input = input_file
|
For this snippet, condor tries to execute something like: “testprog < input_file”. If this is how testprog is supposed to run, great. Otherwise, this will not work. On the other hand, if what you really want is “testprog input_file”, then the following snippet should work.
executable = testprog
should_transfer_files = yes
# Define input file
my_input_file = input_file
transfer_input_files = $(my_input_file)
args = "<other args if required> $(my_input_file)"
|
Note that this possibly works a little bit differently in the java universe, where the input file name shows up as an argument to the executable (java).
- For executables that are supposed to be available at the exact same path on all the nodes in the cluster, it is recommended to mention the path and switch off transferring executable.
executable = /path/to/executable
transfer_executable = false
|
Update (March 9, 2017): Added Debian Jessie related commands, and formatting changes.
Update (April 15, 2017): Added more Debian notes.
Update (June 3, 2018): Added notes about CM.
Update (Dec 05, 2019): Added comparison between HTCondor and Spark.