{"id":5604,"library":"datasketches","title":"Apache DataSketches Library for Python","description":"The Apache DataSketches Library for Python provides a collection of high-performance, stochastic streaming algorithms (sketches) for approximate queries on massive datasets. These sketches offer mathematically proven error bounds and are designed for problems like count distinct, quantiles, most-frequent items, joins, matrix computations, and graph analysis. The current version is 5.2.0, with a regular release cadence as part of the Apache DataSketches project.","status":"active","version":"5.2.0","language":"en","source_language":"en","source_url":"https://github.com/apache/datasketches-python","tags":["data sketches","streaming algorithms","approximate queries","big data","quantiles","distinct count","probabilistic data structures","high-performance computing"],"install":[{"cmd":"pip install datasketches","lang":"bash","label":"Install from PyPI"}],"dependencies":[{"reason":"Required for numerical operations and array handling.","package":"numpy","optional":false},{"reason":"Used for Python-C++ bindings. Replaced pybind11 in version 5.0.0 and later.","package":"nanobind","optional":false}],"imports":[{"note":"The primary module for accessing all sketch classes and utilities.","symbol":"datasketches","correct":"import datasketches"},{"note":"Specific sketch classes are exposed directly under the top-level 'datasketches' module.","wrong":"import kll_ints_sketch","symbol":"kll_ints_sketch","correct":"from datasketches import kll_ints_sketch"}],"quickstart":{"code":"import datasketches\n\n# Create a KLL sketch for integers\nkll_sketch = datasketches.kll_ints_sketch()\n\n# Update the sketch with data\nfor i in range(1000):\n    kll_sketch.update(i)\n\n# Get quantiles\nmedian = kll_sketch.get_quantile(0.5)\nrank_99 = kll_sketch.get_rank(99)\n\nprint(f\"Estimated median: {median}\")\nprint(f\"Estimated rank for value 99: {rank_99}\")\nprint(f\"Estimated number of distinct items: {kll_sketch.get_num_retained()}\")\n","lang":"python","description":"This quickstart demonstrates how to create and use a KLL (Kaplan-Meier-Louis) integer sketch to estimate quantiles and ranks from a stream of data. The KLL sketch is an efficient way to get approximate quantile information with strong error guarantees."},"warnings":[{"fix":"Review your code for C++-style copy constructors and `str()` calls. Adapt to Pythonic `obj.copy()` methods and argument-less `str()` for object representation. Ensure `nanobind` is installed instead of `pybind11`.","message":"Version 5.0.0 introduced significant API changes, including the migration from pybind11 to nanobind for C++ bindings. This also led to more 'pythonic' API patterns, such as using `.copy()` instead of C++-style copy constructors and `str()` taking no arguments.","severity":"breaking","affected_versions":"5.0.0 and later"},{"fix":"Be aware of potential discrepancies when comparing sketch results or binary serializations across different language implementations. Loading sketches serialized from other languages into Python will work as expected, but the creation process may differ.","message":"Python's native integer types do not support unsigned integers or numeric values with fewer than 64 bits directly. This can result in sketches created within Python being non-identical to those created in Java or C++ versions of DataSketches for certain data types or configurations.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Migrate existing code using 'Quantiles Sketch' to `kll_ints_sketch`, `kll_floats_sketch`, or `req_ints_sketch`, `req_floats_sketch` for better accuracy and performance.","message":"The 'Quantiles Sketch' (e.g., `quantiles_ints_sketch`) is considered an inferior algorithm compared to the KLL sketch and is officially deprecated in favor of KLL and REQ sketches.","severity":"deprecated","affected_versions":"3.4.0 and later"},{"fix":"Consult the `datasketches-spark` documentation for detailed Spark configuration settings, including `spark.driver.userClassPathFirst`, `spark.executor.userClassPathFirst`, and the necessary Java options for module exports.","message":"When integrating `datasketches` with Apache Spark, especially with Spark 3.5+ and Java 17+, specific Spark configurations and Java options (`--add-modules=jdk.incubator.foreign`) are required for the driver and executors. Incorrect configuration can lead to runtime errors.","severity":"gotcha","affected_versions":"All versions (specific to Spark 3.5+ / Java 17+)"}],"env_vars":null,"last_verified":"2026-04-13T00:00:00.000Z","next_check":"2026-07-12T00:00:00.000Z"}