Technology, Software, Predictive Analytics and Big Data: 2017

I have written about DisplayLink technology before (the USB3.0 adapter I use for projecting my laptop to dual screens), and my tricks on how I get the device working when it doesn't want to work. As it turns out, I was not even on the right track in the other post, and it is pretty hilarious how the discovery went.

Some days back, the Targus DisplayLink adapter (model ACP70US) completely stopped working: however many times I power-cycled, the monitors were not getting detected (however the mouse and keyboard was being detected every time, obviously). Initially I thought that it was because of an Win 10 update that recently got pushed. I tried switching driver versions, but that didn't help either. Desperate, I wrote in the DisplayLink support forum and the helpful admin took me through a few debugging steps.

It turns out that, my docking station has an issue in the USB 3.0 hub. Specifically, the temperature tolerance of the USB3.0 hub used in the docking station is at issue here. When using this docking station from cold condition (e.g., when it has not been used for a few hours etc), or when the ambient temperature is cold, it has (or, is having) trouble in getting started since the temperature is not within tolerance. It seems, this issue was corrected by the manufacturer (Targus) in subsequent revision or batch. Now, this product was originally purchased in US, however I am now located in Bangalore, India. For what its worth, it was actually raining here for last couple of days and definitely a few degrees colder (almost 3-4 degC colder).

Indeed, it was sunny today here in Bangalore, and I baked the device in sun for sometime. Lo and behold - it started working!!

This has to be the coolest trick I have ever encountered with computer peripherals. Oh well!!

For a project I am experimenting with, I wanted to run multiple instances of a program parallely, each instance dealing with different input data. An example to demonstrate the use case: Let’s say I have a folder full of movies in wmv format, all of which I want to convert to mp4 format, using the ffmpeg program. I would like to run as many as these conversion jobs parallely, possibly on multiple computers.

Given my current exposure to big data solutions like Hadoop/Spark, I initially thought that, for e.g., we can use yarn scheduler and run these jobs on a spark cluster. But my research indicated that these use cases are not really served by yarn/spark. These tasks are not cloud computing tasks, but they are grid computing tasks.

Now, given my graduate days in University of Wisconsin, Madison, for me, grid computing means condor (currently called HTCondor). I have used condor for my Ph.D. research, and it was awesome. So, I took a look, and gave it a try. Since I never had to set up a cluster before, here are some notes for setting up a cluster aka grid for these jobs.

Note that, this is not really a Getting Started guide; this doesn’t show, for example, sample jobs or sample submission files. For that purpose, follow the manual.

Spark vs HTCondor

Let's get started with a small discussion on when it is appropriate to use a system like Spark vs a system like HTCondor. As mentioned before, Spark is an engine for cloud computing and HTCondor is a system for grid computing. In my mind, the difference comes from the following:

HTCondor would let me run multiple instances of the same program on (single or) multiple nodes, each running on different inputs. Spark would let me parallelize and run one computation/algorithm on multiple nodes, from one input.
HTCondor nodes would be appropriate when I need to run the same program on many (somewhat small) input files, possibly being accessed from a shared drive. Spark would be appropriate when the input is a massive dataset, and we can use the map-reduce paradigm for computation.

Having that out of our way, let's proceed with HTCondor.

Machines Setup

Currently using a cluster of 3 VM, each running CentOS 6. Using v8.6.0 of HTCondor, and the manual is available here.

HTCondor Setup

Here are the commands that would install HTCondor on each of these CentOS 6 systems. To be run as root user. The original instruction can be found here.

cd /etc/yum.repos.d/

wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo

wget http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor

rpm --import RPM-GPG-KEY-HTCondor

yum install condor-all

Notes:

CentOS <version> is equivalent to RHEL <version>
You might have to clear yum cache by using ‘yum clear all’
Once the installation is successful, the binaries goes into /usr/bin (/usr becomes the “installation folder”), and the config files go into /etc/condor
As part of the installation process, user & group ‘condor’ is created.

Here are the corresponding commands that would install HTCondor on debian systems (I ended up setting up on my Virtualbox VM for experimenting while on the go, and I run a debian distro there: Bunsen Labs Linux). Again, to be run as root user. The original instruction can be found here.

echo "deb http://research.cs.wisc.edu/htcondor/debian/stable/ jessie contrib"

                                                     >> /etc/apt/sources.list

wget -qO - http://research.cs.wisc.edu/htcondor/debian/HTCondor-Release.gpg.key

                                                     | sudo apt-key add -

apt-get update
apt-get install -t jessie condor

HTCondor Configuration

Configurations are available at condor_config in /etc/condor
One of the most important gotchas of configuring HTCondor is if the reverse DNS lookup works properly on your hosts, meaning an ‘nslookup <your ip address>’ returns your full host name. If this works, lots of trouble would be saved. Otherwise you have to add additional configs as required.
Changed the following configurations (first two because of the above reason):

For all boxes in the grid, including CM (Central Manager):

condor_config: ALLOW_WRITE = *

For all boxes in the grid, including CM (Central Manager), if you want to use it for computations:

condor_config.local: TRUST_UID_DOMAIN=true

For boxes other than the CM, point to the correct CM IP address, or CM host name if reverse DNS is working.

condor_config: CONDOR_HOST = x.x.x.x

For Debian executor boxes, blank out the BASE_CGROUP variable. More notes here.

condor_config.local: BASE_CGROUP =

If you do not want the CM box to run jobs, do not run the STARTD daemon. More information here.

condor_config.local: DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD

Running HTCondor on a Single Node

We will start with running HTCondor on a single node, which will eventually become our central manager.

Start the condor daemons:

On CentOS 6: /sbin/service condor start
On Debian jessi: /etc/init.d/service condor start OR /etc/init.d/condor start

Check if the correct processes are running.

[root@... condor]# ps -elf | egrep condor_

5 S condor 4200 1 0 80 0 - 11653 poll_s 14:53 ? 00:00:00 condor_master -pidfile /var/run/condor/condor_master.pid

4 S root 4241 4200 0 80 0 - 6379 poll_s 14:53 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 498

4 S condor 4242 4200 0 80 0 - 11513 poll_s 14:53 ? 00:00:00 condor_shared_port -f

4 S condor 4243 4200 0 80 0 - 16380 poll_s 14:53 ? 00:00:00 condor_collector -f

4 S condor 4244 4200 0 80 0 - 11639 poll_s 14:53 ? 00:00:00 condor_negotiator -f

4 S condor 4245 4200 0 80 0 - 16646 poll_s 14:53 ? 00:00:00 condor_schedd -f

4 S condor 4246 4200 0 80 0 - 11760 poll_s 14:53 ? 00:00:00 condor_startd -f

0 S root 4297 4098 0 80 0 - 25254 pipe_w 14:54 pts/0 00:00:00 egrep condor_

Check if condor_status returns correctly, with an output similar to this.

[root@... condor]# condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTim

slot1@... LINUX X86_64 Unclaimed Idle 0.080 7975 0+00:00:0

slot2@... LINUX X86_64 Unclaimed Idle 0.000 7975 0+00:00:2

Machines Owner Claimed Unclaimed Matched Preempting Drain

X86_64/LINUX 2 0 0 2 0 0 0

Total 2 0 0 2 0 0 0

Notes:

If the condor_status does not return anything, it is probably because the DNS reverse lookup is not working in the system. Please follow the details in this thread, and start with setting the ALLOW_WRITE variable in the condor_config file to *.
My CentOS6 VM is dual core with 16GB of RAM, so HTCondor automatically breaks this up into 2 slots, each having 1 core/8GB RAM.

I directly experimented with a java universe job, but a vanilla universe job can be tried as well. Once the daemons are running and condor_status returns correctly, get back to the user login and submit the job, using ‘condor_submit <job props file>’. If it takes a bit of time to run the job, condor_status would show it.

[root@akka01-samik condor]# condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTim

slot1@... LINUX X86_64 Claimed Busy 0.000 7975 0+00:00:0

slot2@... LINUX X86_64 Unclaimed Idle 0.000 7975 0+00:00:2

Machines Owner Claimed Unclaimed Matched Preempting Drain

X86_64/LINUX 2 0 1 1 0 0 0

Total 2 0 1 1 0 0 0

The command condor_q would show it too.

[root@... condor]# condor_q

-- Schedd: ... : <x.x.x.x:42646> @ 02/24/17 14:23:43

OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS

samik.r CMD: java 2/24 07:38 _ 1 _ 1 2.0

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Troubleshooting/Notes:

Because of the reverse DNS problem, initially when I submitted the job, it was put on hold, and the condor_status and condor_q was showing as much. Here are the steps I followed for debugging:

I followed the instructions available here, specifically using the ‘condor_q -analyze <job id>’ command.
Looked at the log file StarterLog.slot1

Another time, I got a ClassNotFoundException, since the jar file was not available at proper place. This error showed up in the output folder of the job, in the error file (mentioned in the condor submission file).

Running HTCondor on multiple nodes

Once things work for a single work, it is easy to add more nodes to the cluster. Just install the HTCondor package, change the configurations as described, and start the service.

Notes for running jobs

Some points that were not very obvious to me even after reading the instruction manual:

In vanilla universe jobs, if the executable reads from standard input, use the input command, else use the arguments command. Consider the following snippet:

executable = testprog

input = input_file

For this snippet, condor tries to execute something like: “testprog < input_file”. If this is how testprog is supposed to run, great. Otherwise, this will not work. On the other hand, if what you really want is “testprog input_file”, then the following snippet should work.

executable = testprog

should_transfer_files = yes

# Define input file

my_input_file = input_file

transfer_input_files = $(my_input_file)

args = "<other args if required> $(my_input_file)"

Note that this possibly works a little bit differently in the java universe, where the input file name shows up as an argument to the executable (java).

For executables that are supposed to be available at the exact same path on all the nodes in the cluster, it is recommended to mention the path and switch off transferring executable.

executable = /path/to/executable

transfer_executable = false

Update (March 9, 2017): Added Debian Jessie related commands, and formatting changes.
Update (April 15, 2017): Added more Debian notes.
Update (June 3, 2018): Added notes about CM.
Update (Dec 05, 2019): Added comparison between HTCondor and Spark.

Technology, Software, Predictive Analytics and Big Data

Wednesday, December 6, 2017

DisplayLink again: sun-dried adapter, anyone?

Sunday, March 5, 2017

Using HTCondor for grid computing