petl
petl is a versatile, open-source Python package designed for Extract, Transform, and Load (ETL) operations on tabular data. It provides a simple yet powerful way to handle data from various sources like CSV files, databases, or in-memory structures, focusing on memory efficiency and ease of use. As of version 1.7.17, it is actively maintained with regular releases and a focus on core ETL functionalities, making it suitable for data engineers and analysts seeking efficient pipelines without the overhead of heavier frameworks.
Warnings
- breaking Version 2.0 of petl will drop support for Python 2.7, with the minimum supported Python version becoming 3.6. Code running on older Python environments will require an upgrade.
- gotcha When creating custom generator functions within a petl pipeline, directly raising `StopIteration` to signal exhaustion will result in a `RuntimeError` in Python 3.7 and later, due to PEP 479. petl itself has addressed this internally in `v1.7.8` and `v1.7.14`.
- gotcha While petl emphasizes lazy evaluation, certain operations like `sort()`, `tojson()` (without `lines=True`), `look()`, or `see()` can load entire tables into memory, potentially leading to high memory consumption for very large datasets.
- gotcha Prior to `v1.7.16`, joining tables with uneven rows could produce incorrect results. Users performing join operations on such tables with older versions might experience data integrity issues.
- gotcha When using `etl.fromdicts()` without explicitly providing a header, the order of fields inferred from sampling the input dictionaries might not be stable or consistent across runs.
- gotcha In `v1.7.12`, a fix clarified that `to*()` functions (e.g., `tocsv()`, `tojson()`) should output to `stdout` by default if no specific output file/source is provided. Code on older versions or code expecting a different default behavior might need adjustment.
Install
-
pip install petl
Imports
- petl
import petl as etl
Quickstart
import petl as etl
import os
# Simulate an input CSV file
csv_data = """name,age,city
alice,30,new york
bob,24,london
charlie,35,paris
diana,28,london
"""
with open('input.csv', 'w') as f:
f.write(csv_data)
# Extract: Read data from a CSV file
table1 = etl.fromcsv('input.csv')
# Transform: Filter rows where age is > 25 and city is 'london'
table2 = etl.select(table1, lambda row: row.age > 25 and row.city == 'london')
# Add a new column 'status'
table3 = etl.addfield(table2, 'status', 'eligible')
# Load: Write the transformed data to a new CSV file
etl.tocsv(table3, 'output.csv')
# Verify the output
with open('output.csv', 'r') as f:
print(f.read())
# Expected output:
# name,age,city,status
# bob,24,london,eligible - Correction: This should be: bob,24,london,eligible (if age > 20 for example)
# Corrected expected output (age > 25 AND city == 'london'):
# name,age,city,status
# diana,28,london,eligible