RDT (Reversible Data Transforms)
RDT (Reversible Data Transforms) is a Python library that enables the transformation of raw data into fully numerical data, making it ready for various data science tasks. The transformations are designed to be reversible, allowing conversion back to the original data format. It is part of The Synthetic Data Vault Project and is actively maintained by DataCebo, with frequent updates and releases. The current version is 1.21.0.
Warnings
- breaking RDT versions prior to 0.2.0 had a significantly different API. Version 0.2.0 introduced a brand new API, removed the old metadata JSON from user arguments, and made transformers work exclusively with pandas Series.
- breaking Version 0.6.0 brought major changes to the `HyperTransformer` and `BaseTransformer` APIs, enabling multi-column input for transformers and allowing sequences of transformers per column.
- deprecated The `frequencyEncoder` transformer is deprecated and will not be supported in future RDT versions.
- deprecated The distribution option names for `GaussianNormalizer` have been updated to be consistent with `scipy`. `gaussian` is now `norm`, `student_t` is `t`, and `truncated_gaussian` is `truncnorm`.
- gotcha The `sdtype` 'text' was removed in RDT versions 1.13.0 and newer. Attempting to use 'text' as an sdtype will lead to errors.
- gotcha Python 3.6 support was dropped in RDT 1.0.0, and later versions have stricter Python requirements (e.g., currently >=3.9, <3.15).
Install
-
pip install rdt
Imports
- HyperTransformer
from rdt import HyperTransformer
- get_demo
from rdt import get_demo
Quickstart
import pandas as pd
from rdt import HyperTransformer, get_demo
# Load a demo dataset
customers = get_demo()
print("Original Data:\n", customers.head())
# Initialize and detect config with HyperTransformer
ht = HyperTransformer()
ht.detect_initial_config(data=customers)
print("\nDetected Config:\n", ht.get_config())
# Transform the data
transformed_data = ht.transform(customers)
print("\nTransformed Data (first 5 rows):\n", transformed_data.head())
# Reverse transform the data back to original format
reversed_data = ht.reverse_transform(transformed_data)
print("\nReversed Data (first 5 rows):\n", reversed_data.head())