About Thrill
Thrill is a C++ framework for distributed Big Data batch computations on a cluster of machines. It is currently being designed and developed as a research project at Karlsruhe Institute of Technology and is in early testing.
We last presented our ongoing work on Thrill at the IEEE Conference on Big Data in December 2016. A longer technical report about the design and goals is available at arXiv: https://arxiv.org/abs/1608.05634. This paper gives a good introduction into the concepts and ideas. The slides of our presentation at the conference are also available and give a visual introduction.
The development code is available on github under a BSD open-source license and outside contributors are welcome to join and contact us.
Doxygen documentation automatically built from the master is available. The doxygen documentation also contains a tutorial “ Write your first Thrill program”.
GitHub: | Travis-CI: |
Some of the main goals for the design are:
-
To create a high-performance Big Data batch processing framework.
-
Expose a powerful C++ user interface, that is efficiently tied to the framework’s internals. The interface supports the Map/Reduce paradigm, but also versatile “dataflow graph” style computations like Apache Spark or Apache Flink with host language control flow.
See our examples of WordCount, PageRank, and k-Means. -
Leverage newest C++11 and C++14 features like lambda functions and auto types to make writing user programs easy and convenient.
-
Enable compilation of binary programs with full compile-time optimization runnable directly on hardware without a virtual machine interpreter. Exploit cache effects due to less indirections than in Java and other languages. Save energy and money by reducing computation overhead.
-
Due to the zero-overhead concept of C++, enable applications to process small datatypes efficiently with no overhead.
-
Support external memory well by implementing I/O-efficient algorithms where needed, but keep most computations in RAM.
-
Perform full pipelining of data flows, where pipelined stages are often combined at compile time.
-
Avoid all unnecessary round trips of data to memory or disk.
-
Enable reproducible benchmarking of programs due to RAII memory management.
In the long term the framework can play a mediator role between Big Data applications and lower layer algorithms research, which may include:
-
Research into more communication efficient distributed algorithms for basic operations like sorting, selection, hashing, etc.
-
Support fault tolerant execution with lower overheads due to fault-resilient algorithms and better checkpointing.
-
Join Big Data research with succinct data structures and compression to enable more computations to be performed in RAM.
Current Authors and Contributors:
Michael Axtmann, Timo Bingmann, Emanuel Jöbstl, Sebastian Lamm, Huyen Chau Nguyen, Alexander Noe, Matthias Stumpp, Peter Sanders, Sebastian Schlag, Tobias Sturm.
Weblog Posts
-
Word Count Example
This C++ snippet shows our (unoptimized) working example of Word Count in Thrill.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
using namespace thrill; size_t WordCountExample(Context& ctx) { auto lines = ReadLines(ctx, "wordcount.in"); auto word_pairs = lines.template FlatMap<WordCountPair>( [](const std::string& line, auto emit) -> void { /* map lambda: emit each word */ for (const std::string& word : common::split(line, ' ')) { if (word.size() != 0) emit(WordCountPair(word, 1)); } }); auto red_words = word_pairs.ReduceBy( [](const WordCountPair& in) -> std::string { /* reduction key: the word string */ return in.first; }, [](const WordCountPair& a, const WordCountPair& b) -> WordCountPair { /* associative reduction operator: add counters */ return WordCountPair(a.first, a.second + b.second); }); red_words.Map( [](const WordCountPair& wc) { return wc.first + ": " + std::to_string(wc.second); }) .WriteLinesMany( "wordcount_" + std::to_string(ctx.my_rank()) + ".out"); return 0; } int main(int argc, char* argv[]) { return api::Run(WordCountExample); }