About Thrill

Thrill is a C++ framework for distributed Big Data batch computations on a cluster of machines. It is currently being designed and developed as a research project at Karlsruhe Institute of Technology and is in early testing.

We last presented our ongoing work on Thrill at the IEEE Conference on Big Data in December 2016. A longer technical report about the design and goals is available at arXiv: https://arxiv.org/abs/1608.05634. This paper gives a good introduction into the concepts and ideas. The slides of our presentation at the conference are also available and give a visual introduction.

The development code is available on github under a BSD open-source license and outside contributors are welcome to join and contact us.

Doxygen documentation automatically built from the master is available. The doxygen documentation also contains a tutorial “ Write your first Thrill program”.

GitHub:

Travis-CI:

Some of the main goals for the design are:

To create a high-performance Big Data batch processing framework.
Expose a powerful C++ user interface, that is efficiently tied to the framework’s internals. The interface supports the Map/Reduce paradigm, but also versatile “dataflow graph” style computations like Apache Spark or Apache Flink with host language control flow.
See our examples of WordCount, PageRank, and k-Means.
Leverage newest C++11 and C++14 features like lambda functions and auto types to make writing user programs easy and convenient.
Enable compilation of binary programs with full compile-time optimization runnable directly on hardware without a virtual machine interpreter. Exploit cache effects due to less indirections than in Java and other languages. Save energy and money by reducing computation overhead.
Due to the zero-overhead concept of C++, enable applications to process small datatypes efficiently with no overhead.
Support external memory well by implementing I/O-efficient algorithms where needed, but keep most computations in RAM.
Perform full pipelining of data flows, where pipelined stages are often combined at compile time.
Avoid all unnecessary round trips of data to memory or disk.
Enable reproducible benchmarking of programs due to RAII memory management.

In the long term the framework can play a mediator role between Big Data applications and lower layer algorithms research, which may include:

Research into more communication efficient distributed algorithms for basic operations like sorting, selection, hashing, etc.
Support fault tolerant execution with lower overheads due to fault-resilient algorithms and better checkpointing.
Join Big Data research with succinct data structures and compression to enable more computations to be performed in RAM.

Current Authors and Contributors:

Michael Axtmann, Timo Bingmann, Emanuel Jöbstl, Sebastian Lamm, Huyen Chau Nguyen, Alexander Noe, Matthias Stumpp, Peter Sanders, Sebastian Schlag, Tobias Sturm.

Weblog Posts

Sep 26, 2015

Word Count Example

This C++ snippet shows our (unoptimized) working example of Word Count in Thrill.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
using namespace thrill;

size_t WordCountExample(Context& ctx) {

    auto lines = ReadLines(ctx, "wordcount.in");

    auto word_pairs = lines.template FlatMap<WordCountPair>(
        [](const std::string& line, auto emit) -> void {
                /* map lambda: emit each word */
            for (const std::string& word : common::split(line, ' ')) {
                if (word.size() != 0)
                    emit(WordCountPair(word, 1));
            }
        });

    auto red_words =  word_pairs.ReduceBy(
        [](const WordCountPair& in) -> std::string {
            /* reduction key: the word string */
            return in.first;
        },
        [](const WordCountPair& a, const WordCountPair& b) -> WordCountPair {
            /* associative reduction operator: add counters */
            return WordCountPair(a.first, a.second + b.second);
        });

    red_words.Map(
        [](const WordCountPair& wc) {
            return wc.first + ": " + std::to_string(wc.second);
        })
    .WriteLinesMany(
        "wordcount_" + std::to_string(ctx.my_rank()) + ".out");

    return 0;
}

int main(int argc, char* argv[]) {
    return api::Run(WordCountExample);
}