The final step in this tutorial is to enable reading and writing of files instead of generating random point.

In Thrill line-based text files are easily read using ReadLines(). This DIA operation creates a DIA<std::string> which can be parsed further. The following function performs such an operation and parses the lines as "<x> <y>" using std::istringstream into our Point struct.

thrill::DIA<Point> LoadPoints(thrill::Context& ctx, const char* path) {
    // load points from text file
    auto points =
        ReadLines(ctx, path)
        .Map(
            [](const std::string& input) {
                // parse "<x> <y>" lines
                std::istringstream iss(input);
                Point p;
                iss >> p.x >> p.y;
                if (iss.peek() != EOF)
                    die("Could not parse point coordinates: " << input);
                return p;
            });
    return points.Cache();
}

LoadPoints returns a DIA<Point>, so we need to refactor the random point generator into a similar function.

thrill::DIA<Point> GeneratePoints(thrill::Context& ctx) {
    std::default_random_engine rng(std::random_device { } ());
    std::uniform_real_distribution<double> dist(0.0, 1000.0);
    // generate 100 random points using uniform distribution
    auto points =
        Generate(
            ctx, /* size */ 100,
            [&](const size_t&) {
                return Point { dist(rng), dist(rng) };
            });
    // Execute() is require due to lazy evaluation
    return points.Cache().Execute();
}

Interestingly, we have to add an Execute() to explicitly generate the cached DIA prior to returning from the function, because otherwise the random generator objects are destructed while still be used by the lambda function. This is one of the pitfalls due to lazy DIA operation evaluation.

With LoadPoints() and GeneratePoints(), we only have to add a DIA<Point> parameter to Process().

//! our main processing method

void Process(const thrill::DIA<Point>& points, const char* output) {

To make the output configurable we also add an output parameter. Line-based text files can be written in Thrill using WriteLines(), which requires a DIA<std::string>. So we have to map Points to std::string objects prior to calling the write operation.

    if (output) {
        // write output as "x y" lines
        centers
        .Map([](const Point& p) {
                 return std::to_string(p.x) + " " + std::to_string(p.y);
             })
        .WriteLines(output);
    }
    else {
        centers.Print("final centers");
    }

The only remaining thing to do it to pass the command line parameters to Process(). This is a very simplistic method to process the command line, see other examples in Thrill's source for a more elaborate command line parser.

int main(int argc, char* argv[]) {
    // launch Thrill program: the lambda function will be run on each worker.
    return thrill::Run(
        [&](thrill::Context& ctx) {
            if (argc == 1)
                Process(GeneratePoints(ctx), nullptr);
            else if (argc == 2)
                Process(LoadPoints(ctx, argv[1]), nullptr);
            else if (argc == 3)
                Process(LoadPoints(ctx, argv[1]), argv[2]);
            else
                std::cerr << "Usage: " << argv[0]
                          << " [points] [output]" << std::endl;
        });
}

See the complete example code examples/tutorial/k-means_step5.cpp

The source package contains a file k-means_points.txt as an example input.

Next Steps

Author: Timo Bingmann (2016)