T-Digest data structure
The `tdigest` library is a Python implementation of Ted Dunning's t-digest data structure, designed for efficient and accurate percentile and quantile estimation from streaming or distributed data. It enables computations like percentiles, quantiles, and trimmed means. The current official PyPI version is 0.5.2.2, with releases focusing on performance improvements and bug fixes. The library is actively maintained with occasional updates.
Common errors
-
AttributeError: 'TDigest' object has no attribute 'quantile'
cause Attempting to call the `quantile` method on a `TDigest` object in version 0.5.0 or later, after it was renamed.fixUse the `cdf` method instead: `digest.cdf(x)`. -
Download error on https://pypi.org/simple/cython/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION]
cause This error can occur with outdated `pip` versions or Python environments (often Python 2 or older Python 3.x) attempting to access PyPI mirrors with modern SSL/TLS configurations. This specific error was reported when installing `tdigest` indirectly pulling `cython`.fixUpgrade `pip` to the latest version (`pip install --upgrade pip`) and ensure your Python environment supports modern TLS protocols. Consider upgrading your Python version if it's very old. -
TypeError: unsupported operand type(s) for +: 'TDigest' and 'NoneType'
cause This typically happens when trying to merge a `TDigest` object with a variable that is `None`, often due to an uninitialized digest or a failed previous operation.fixEnsure all `TDigest` objects involved in merge operations (`+` operator or `merge` method) are properly initialized and contain data. For example, `TDigest()` instead of `None`.
Warnings
- breaking The `quantile` method was renamed to `cdf` in version 0.5.0 for more accurate terminology. Calling `quantile()` on versions 0.5.0 and later will result in an AttributeError.
- gotcha The latest PyPI version (0.5.2.2) lags behind the latest GitHub release (v0.6.0.1). Users seeking the absolute latest features or bug fixes might need to install directly from GitHub, though this is not typically recommended for production.
- gotcha Older documentation and PyPI metadata mention compatibility with Python 2. Given Python 2's end-of-life, new development and most recent versions of `tdigest` are exclusively targeting Python 3. Relying on Python 2 compatibility is strongly discouraged and likely to lead to issues.
- gotcha While `tdigest` objects can be serialized to and from Python dictionaries using `to_dict()` and `update_from_dict()`, direct pickling might not always be forward/backward compatible across minor versions due to internal changes in the underlying data structure (`accumulation_tree` in v0.5.0).
Install
-
pip install tdigest
Imports
- TDigest
from tdigest import TDigest
Quickstart
import numpy as np
from tdigest import TDigest
# Create a TDigest instance
digest = TDigest()
# Update the digest sequentially with random data
for _ in range(5000):
digest.update(np.random.random())
# Or update the digest in batches
another_digest = TDigest()
another_digest.batch_update(np.random.random(5000))
# Compute the 15th percentile
print(f"15th percentile (sequential): {digest.percentile(15)}")
print(f"15th percentile (batch): {another_digest.percentile(15)}")
# Sum two digests
sum_digest = digest + another_digest
print(f"30th percentile (summed): {sum_digest.percentile(30)}")