Modin

0.37.1 · active · verified Sat Apr 11

Modin is an open-source Python library designed to accelerate pandas workflows by leveraging distributed computing frameworks like Ray, Dask, or Unidist. It aims to be a drop-in replacement for pandas, requiring only a single line change to the import statement. Modin supports datasets ranging from MBs to TBs, offering significant speedups, especially for larger data, and is actively maintained with frequent releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use Modin by simply changing the pandas import statement. Modin automatically parallelizes DataFrame operations across available cores/nodes using the installed backend (Ray, Dask, or Unidist). For small datasets, the performance difference might be negligible or even slightly worse due to overhead, but it offers significant speedups for large data.

import modin.pandas as pd

# Create a Modin DataFrame (this uses your configured backend, e.g., Ray or Dask)
data = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data)

print("Original Modin DataFrame:")
print(df)

# Perform a common operation, like adding a new column
df['col3'] = df['col1'] + df['col2']

print("DataFrame after operation:")
print(df)

# Note: For larger datasets, Modin's performance benefits become more apparent.
# Example with a larger dataset (requires a CSV file, e.g., 'nyc_taxi_data.csv')
# Uncomment and replace 'path/to/your/data.csv' with an actual path to test with large data.
# try:
#     large_df = pd.read_csv('path/to/your/data.csv')
#     print(f"Loaded large DataFrame with {len(large_df)} rows.")
#     print(large_df.head())
# except FileNotFoundError:
#     print("Skipping large dataset example: 'path/to/your/data.csv' not found.")

view raw JSON →