Thrill
0.1
|
Duplicate detection to identify all elements occuring only on one worker.
This information can be used to locally reduce uniquely-occuring elements. Therefore this saves communication volume in operations such as api::Reduce() or api::Join().
Internally, this duplicate detection uses a golomb encoded distributed single shot bloom filter to find duplicates and non-duplicates with as low communication volume as possible. Due to the bloom filter's inherent properties, this has false duplicates but no false non-duplicates.
Should only be used when a large amount of uniquely-occuring elements are expected.
Definition at line 46 of file duplicate_detection.hpp.
#include <duplicate_detection.hpp>
Public Member Functions | |
size_t | FindNonDuplicates (std::vector< bool > &non_duplicates, std::vector< size_t > &hashes, Context &context, size_t dia_id) |
Identifies all hashes which occur on only a single worker. More... | |
Private Types | |
using | GolombBitStreamReader = core::GolombBitStreamReader< data::CatStream::Reader > |
using | GolombBitStreamWriter = core::GolombBitStreamWriter< data::CatStream::Writer > |
using | GolumbDeltaReader = core::DeltaStreamReader< GolombBitStreamReader, size_t, 1 > |
using | GolumbDeltaWriter = core::DeltaStreamWriter< GolombBitStreamWriter, size_t, 1 > |
Private Member Functions | |
void | ReadEncodedHashesToVector (const data::CatStreamPtr &stream_pointer, std::vector< bool > &non_duplicates, size_t golomb_param) |
Reads a golomb encoded bitset from a data stream and returns it's contents in form of a vector of hashes. More... | |
void | WriteEncodedHashes (const data::CatStreamPtr &stream_pointer, const std::vector< size_t > &hashes, size_t golomb_param, size_t num_workers, size_t max_hash) |
Sends all hashes in the range [max_hash / num_workers * p, max_hash / num_workers * (p + 1)) to worker p. More... | |
Static Private Attributes | |
static constexpr bool | debug = false |
|
private |
Definition at line 55 of file duplicate_detection.hpp.
|
private |
Definition at line 52 of file duplicate_detection.hpp.
|
private |
Definition at line 61 of file duplicate_detection.hpp.
|
private |
Definition at line 58 of file duplicate_detection.hpp.
|
inline |
Identifies all hashes which occur on only a single worker.
Returns all local uniques in form of a vector of hashes.
non_duplicates | Empty vector, which contains all non-duplicate hashes after this method |
hashes | Hashes for all elements on this worker. |
context | Thrill context, used for collective communication |
dia_id | Id of the operation, which calls this method. Used to uniquely identify the data streams used. |
Definition at line 150 of file duplicate_detection.hpp.
References FlowControlChannel::AllReduce(), StreamData::Writers::Close(), Context::GetNewCatStream(), Context::net, Context::num_workers(), DuplicateDetection::ReadEncodedHashesToVector(), sLOG, and DuplicateDetection::WriteEncodedHashes().
|
inlineprivate |
Reads a golomb encoded bitset from a data stream and returns it's contents in form of a vector of hashes.
stream_pointer | Pointer to data stream |
non_duplicates | Target vector for hashes, should be empty beforehand |
golomb_param | Golomb parameter |
Definition at line 112 of file duplicate_detection.hpp.
References DeltaStreamReader< StreamReader, Type, offset_ >::HasNext(), and DeltaStreamReader< StreamReader, Type, offset_ >::Next().
Referenced by DuplicateDetection::FindNonDuplicates().
|
inlineprivate |
Sends all hashes in the range [max_hash / num_workers * p, max_hash / num_workers * (p + 1)) to worker p.
These hashes are encoded with a Golomb encoder in core.
stream_pointer | Pointer to data stream |
hashes | Sorted vector of all hashes modulo max_hash |
golomb_param | Golomb parameter |
num_workers | Number of workers in this Thrill process |
max_hash | Modulo for all hashes |
Definition at line 74 of file duplicate_detection.hpp.
References thrill::common::CalculateLocalRange(), Range::end, and DeltaStreamWriter< StreamWriter, Type, offset_ >::Put().
Referenced by DuplicateDetection::FindNonDuplicates().
|
staticprivate |
Definition at line 48 of file duplicate_detection.hpp.