Pandarallel

1.6.5 · active · verified Tue Apr 14

Pandarallel is a Python library that extends Pandas to support parallel processing across multiple CPU cores. It aims to significantly speed up Pandas operations on large datasets by distributing computations, often requiring only a one-line code change. The library also provides progress bars. It is currently at version 1.6.5 and is actively maintained.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize pandarallel and then use `parallel_apply` on a Pandas Series. It includes a simple, CPU-bound function to showcase the parallelization effect. The `nb_workers` is explicitly set to the CPU count for clarity, and a progress bar is enabled.

import pandas as pd
from pandarallel import pandarallel
import os

# Initialize pandarallel. It defaults to using all available CPU cores.
# progress_bar=True is often useful to visualize progress.
pandarallel.initialize(nb_workers=os.cpu_count(), progress_bar=True)

# Create a sample DataFrame with some data
data = {'col1': range(1_000_000), 'col2': [f'item_{i}' for i in range(1_000_000)]}
df = pd.DataFrame(data)

# Define a CPU-bound function to apply
def example_computation(x):
    # Simulate a computationally intensive task
    res = 0
    for i in range(50):
        res += (x * i) ** 0.5
    return res

# Apply the function in parallel using pandarallel's parallel_apply
# This replaces df['col1'].apply(example_computation)
print("Starting parallel computation...")
df['result'] = df['col1'].parallel_apply(example_computation)

print("Computation complete. First 5 rows of the DataFrame with results:")
print(df.head())

view raw JSON →