Thrill  0.1
Running Thrill programs on a cluster

Execution Model

Thrill programs are binary executables which run simultaneously on h hosts in a cluster with equal hardware configuration. Generally, they use all cores and RAM available on the host and communicate via a network protocol. If external disk memory is used, this should be on local disks.

cluster-execution.svg

The input and output is usually stored on a fault tolerance distributed file system like Ceph, Lustre, NFS, or HDFS. Currently Thrill only supports the standard (POSIX) file system (HDFS's proprietary interface may be added some day).

In general, the startup procedure has the goal to launch the same binary on all hosts. On startup, the Thrill binary will then connect to the other running instances via the network. Hence, the host, port and other connection parameters of the other instances must be passed.

We provide a set of run scripts in the source package, which automatically launch Thrill programs via SSH, on a set of AWS EC2 instances, and on clusters scheduled by slurm. MPI is also supported natively.

When started with no parameters, Thrill programs run locally and simulate a two host network with all cores available on the machine divided into the two virtual hosts. Communication is performed via local kernel sockets on Linux.

Thrill's configuration is passed via the standard Unix environment variables. This makes it easy to traverse hosts in a cluster. See Environment Configuration Variables for more configuration options.

Running via SSH

The bash script run/ssh/invoke.sh launches the same binary on a set of hosts via ssh. These should be setup with ssh keys for authentication.

For example, the following will launch the WordCount example on two hosts called jupiter and saturn. We assume Thrill is in directory ~/thrill.

$ cd ~/thrill/build/examples/word_count/
$ ~/thrill/run/ssh/invoke.sh -h "jupiter saturn" word_count_run --generate 1000 --output result-

This ssh invocation assumes a common distributed file system on all hosts including the launcher. This is a common setup in Linux clusters with an NFS server. In principle, the script access each machine via ssh and launches the word_count_run executable with the additional connection parameters. All arguments after word_count_run are passed on.

One can also launch Thrill programs on a "remote" cluster which does not contain the executable. This is done by simply adding the -c option. The script then copies the executable via ssh to the remote hosts and runs it from /tmp.

Running via MPI

The simplest method to run Thrill programs in a cluster is using MPI. If compiled with an MPI library like OpenMPI, the Thrill program will detect that it is running under MPI and use MPI calls to communicate. No configuration is needed.

However, we have had many problems with multi-threading and MPI libraries.

Running on AWS EC2

We often run Thrill programs on a set of bare-bone EC2 instances running on AWS. In principle these are accessed via SSH and the Thrill program is uploaded from the development machine.

The process of looking up the currently running EC2 instances, reading the public and private IPs, etc, is automated by the scripts in run/ec2/. The python scripts require the boto3 library, which is available via pip.

First run ~/thrill/run/ec2/make_env.py and check the output for running EC2 instances.

Then run ~/thrill/run/ec2/invoke.sh which runs ssh's invoke with all EC2 instances.