Apache DataSketches Library for Python

5.2.0 · active · verified Mon Apr 13

The Apache DataSketches Library for Python provides a collection of high-performance, stochastic streaming algorithms (sketches) for approximate queries on massive datasets. These sketches offer mathematically proven error bounds and are designed for problems like count distinct, quantiles, most-frequent items, joins, matrix computations, and graph analysis. The current version is 5.2.0, with a regular release cadence as part of the Apache DataSketches project.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create and use a KLL (Kaplan-Meier-Louis) integer sketch to estimate quantiles and ranks from a stream of data. The KLL sketch is an efficient way to get approximate quantile information with strong error guarantees.

import datasketches

# Create a KLL sketch for integers
kll_sketch = datasketches.kll_ints_sketch()

# Update the sketch with data
for i in range(1000):
    kll_sketch.update(i)

# Get quantiles
median = kll_sketch.get_quantile(0.5)
rank_99 = kll_sketch.get_rank(99)

print(f"Estimated median: {median}")
print(f"Estimated rank for value 99: {rank_99}")
print(f"Estimated number of distinct items: {kll_sketch.get_num_retained()}")

view raw JSON →