Thrill  0.1
DuplicateDetection Class Reference

Detailed Description

Duplicate detection to identify all elements occuring only on one worker.

This information can be used to locally reduce uniquely-occuring elements. Therefore this saves communication volume in operations such as api::Reduce() or api::Join().

Internally, this duplicate detection uses a golomb encoded distributed single shot bloom filter to find duplicates and non-duplicates with as low communication volume as possible. Due to the bloom filter's inherent properties, this has false duplicates but no false non-duplicates.

Should only be used when a large amount of uniquely-occuring elements are expected.

Definition at line 46 of file duplicate_detection.hpp.

#include <duplicate_detection.hpp>

Public Member Functions

size_t FindNonDuplicates (std::vector< bool > &non_duplicates, std::vector< size_t > &hashes, Context &context, size_t dia_id)
 Identifies all hashes which occur on only a single worker. More...
 

Private Types

using GolombBitStreamReader = core::GolombBitStreamReader< data::CatStream::Reader >
 
using GolombBitStreamWriter = core::GolombBitStreamWriter< data::CatStream::Writer >
 
using GolumbDeltaReader = core::DeltaStreamReader< GolombBitStreamReader, size_t, 1 >
 
using GolumbDeltaWriter = core::DeltaStreamWriter< GolombBitStreamWriter, size_t, 1 >
 

Private Member Functions

void ReadEncodedHashesToVector (const data::CatStreamPtr &stream_pointer, std::vector< bool > &non_duplicates, size_t golomb_param)
 Reads a golomb encoded bitset from a data stream and returns it's contents in form of a vector of hashes. More...
 
void WriteEncodedHashes (const data::CatStreamPtr &stream_pointer, const std::vector< size_t > &hashes, size_t golomb_param, size_t num_workers, size_t max_hash)
 Sends all hashes in the range [max_hash / num_workers * p, max_hash / num_workers * (p + 1)) to worker p. More...
 

Static Private Attributes

static constexpr bool debug = false
 

Member Typedef Documentation

◆ GolombBitStreamReader

◆ GolombBitStreamWriter

◆ GolumbDeltaReader

Definition at line 61 of file duplicate_detection.hpp.

◆ GolumbDeltaWriter

Definition at line 58 of file duplicate_detection.hpp.

Member Function Documentation

◆ FindNonDuplicates()

size_t FindNonDuplicates ( std::vector< bool > &  non_duplicates,
std::vector< size_t > &  hashes,
Context context,
size_t  dia_id 
)
inline

Identifies all hashes which occur on only a single worker.

Returns all local uniques in form of a vector of hashes.

Parameters
non_duplicatesEmpty vector, which contains all non-duplicate hashes after this method
hashesHashes for all elements on this worker.
contextThrill context, used for collective communication
dia_idId of the operation, which calls this method. Used to uniquely identify the data streams used.
Returns
Modulo used on all hashes. (Use this modulo on all hashes to identify possible non-duplicates)

Definition at line 150 of file duplicate_detection.hpp.

References FlowControlChannel::AllReduce(), StreamData::Writers::Close(), Context::GetNewCatStream(), Context::net, Context::num_workers(), DuplicateDetection::ReadEncodedHashesToVector(), sLOG, and DuplicateDetection::WriteEncodedHashes().

Referenced by ReducePrePhase< TableItem, Key, Value, KeyExtractor, ReduceFunction, VolatileKey, BlockWriter, ReduceConfig, IndexFunction, EqualToFunction, HashFunction, true >::FlushAll().

◆ ReadEncodedHashesToVector()

void ReadEncodedHashesToVector ( const data::CatStreamPtr stream_pointer,
std::vector< bool > &  non_duplicates,
size_t  golomb_param 
)
inlineprivate

Reads a golomb encoded bitset from a data stream and returns it's contents in form of a vector of hashes.

Parameters
stream_pointerPointer to data stream
non_duplicatesTarget vector for hashes, should be empty beforehand
golomb_paramGolomb parameter

Definition at line 112 of file duplicate_detection.hpp.

References DeltaStreamReader< StreamReader, Type, offset_ >::HasNext(), and DeltaStreamReader< StreamReader, Type, offset_ >::Next().

Referenced by DuplicateDetection::FindNonDuplicates().

◆ WriteEncodedHashes()

void WriteEncodedHashes ( const data::CatStreamPtr stream_pointer,
const std::vector< size_t > &  hashes,
size_t  golomb_param,
size_t  num_workers,
size_t  max_hash 
)
inlineprivate

Sends all hashes in the range [max_hash / num_workers * p, max_hash / num_workers * (p + 1)) to worker p.

These hashes are encoded with a Golomb encoder in core.

Parameters
stream_pointerPointer to data stream
hashesSorted vector of all hashes modulo max_hash
golomb_paramGolomb parameter
num_workersNumber of workers in this Thrill process
max_hashModulo for all hashes

Definition at line 74 of file duplicate_detection.hpp.

References thrill::common::CalculateLocalRange(), Range::end, and DeltaStreamWriter< StreamWriter, Type, offset_ >::Put().

Referenced by DuplicateDetection::FindNonDuplicates().

Member Data Documentation

◆ debug

constexpr bool debug = false
staticprivate

Definition at line 48 of file duplicate_detection.hpp.


The documentation for this class was generated from the following file: