Apache DataSketches Library for Python
The Apache DataSketches Library for Python provides a collection of high-performance, stochastic streaming algorithms (sketches) for approximate queries on massive datasets. These sketches offer mathematically proven error bounds and are designed for problems like count distinct, quantiles, most-frequent items, joins, matrix computations, and graph analysis. The current version is 5.2.0, with a regular release cadence as part of the Apache DataSketches project.
Warnings
- breaking Version 5.0.0 introduced significant API changes, including the migration from pybind11 to nanobind for C++ bindings. This also led to more 'pythonic' API patterns, such as using `.copy()` instead of C++-style copy constructors and `str()` taking no arguments.
- gotcha Python's native integer types do not support unsigned integers or numeric values with fewer than 64 bits directly. This can result in sketches created within Python being non-identical to those created in Java or C++ versions of DataSketches for certain data types or configurations.
- deprecated The 'Quantiles Sketch' (e.g., `quantiles_ints_sketch`) is considered an inferior algorithm compared to the KLL sketch and is officially deprecated in favor of KLL and REQ sketches.
- gotcha When integrating `datasketches` with Apache Spark, especially with Spark 3.5+ and Java 17+, specific Spark configurations and Java options (`--add-modules=jdk.incubator.foreign`) are required for the driver and executors. Incorrect configuration can lead to runtime errors.
Install
-
pip install datasketches
Imports
- datasketches
import datasketches
- kll_ints_sketch
from datasketches import kll_ints_sketch
Quickstart
import datasketches
# Create a KLL sketch for integers
kll_sketch = datasketches.kll_ints_sketch()
# Update the sketch with data
for i in range(1000):
kll_sketch.update(i)
# Get quantiles
median = kll_sketch.get_quantile(0.5)
rank_99 = kll_sketch.get_rank(99)
print(f"Estimated median: {median}")
print(f"Estimated rank for value 99: {rank_99}")
print(f"Estimated number of distinct items: {kll_sketch.get_num_retained()}")