Pandarallel
Pandarallel is a Python library that extends Pandas to support parallel processing across multiple CPU cores. It aims to significantly speed up Pandas operations on large datasets by distributing computations, often requiring only a one-line code change. The library also provides progress bars. It is currently at version 1.6.5 and is actively maintained.
Warnings
- gotcha Pandarallel can require up to twice the memory of standard Pandas operations. Ensure your system has sufficient RAM, especially for large datasets.
- gotcha On Windows, functions passed to `pandarallel` must be self-contained and should not depend on external resources (e.g., global variables, complex closures) due to Python's `multiprocessing` 'spawn' start method.
- gotcha Parallelization introduces overhead. For small datasets or very fast operations, `pandarallel` might not provide a speedup, or could even be slower than native Pandas.
- gotcha Pandarallel scales best with the number of *physical* CPU cores, not necessarily logical cores (hyperthreading). Setting `nb_workers` higher than physical cores may not yield further performance gains.
- gotcha The `shm_size_mb` parameter in `pandarallel.initialize()` is deprecated and should no longer be used.
- gotcha Pandarallel can sometimes get stuck without raising errors if all physical cores are heavily utilized by other background processes. The progress bar might stop updating.
- gotcha Functions defined locally (e.g., inside another function) or using closures may lead to `AttributeError: Can't pickle local object` errors, a common issue with Python's multiprocessing.
Install
-
pip install pandarallel
Imports
- pandarallel
from pandarallel import pandarallel
Quickstart
import pandas as pd
from pandarallel import pandarallel
import os
# Initialize pandarallel. It defaults to using all available CPU cores.
# progress_bar=True is often useful to visualize progress.
pandarallel.initialize(nb_workers=os.cpu_count(), progress_bar=True)
# Create a sample DataFrame with some data
data = {'col1': range(1_000_000), 'col2': [f'item_{i}' for i in range(1_000_000)]}
df = pd.DataFrame(data)
# Define a CPU-bound function to apply
def example_computation(x):
# Simulate a computationally intensive task
res = 0
for i in range(50):
res += (x * i) ** 0.5
return res
# Apply the function in parallel using pandarallel's parallel_apply
# This replaces df['col1'].apply(example_computation)
print("Starting parallel computation...")
df['result'] = df['col1'].parallel_apply(example_computation)
print("Computation complete. First 5 rows of the DataFrame with results:")
print(df.head())