{"id":6022,"library":"pandarallel","title":"Pandarallel","description":"Pandarallel is a Python library that extends Pandas to support parallel processing across multiple CPU cores. It aims to significantly speed up Pandas operations on large datasets by distributing computations, often requiring only a one-line code change. The library also provides progress bars. It is currently at version 1.6.5 and is actively maintained.","status":"active","version":"1.6.5","language":"en","source_language":"en","source_url":"https://github.com/nalepae/pandarallel","tags":["pandas","parallel-processing","performance","dataframe","multiprocessing"],"install":[{"cmd":"pip install pandarallel","lang":"bash","label":"Install with pip"}],"dependencies":[{"reason":"Pandarallel is built on top of Pandas and parallelizes its operations.","package":"pandas","optional":false},{"reason":"Frequently used in conjunction with Pandas for numerical operations.","package":"numpy","optional":false}],"imports":[{"symbol":"pandarallel","correct":"from pandarallel import pandarallel"}],"quickstart":{"code":"import pandas as pd\nfrom pandarallel import pandarallel\nimport os\n\n# Initialize pandarallel. It defaults to using all available CPU cores.\n# progress_bar=True is often useful to visualize progress.\npandarallel.initialize(nb_workers=os.cpu_count(), progress_bar=True)\n\n# Create a sample DataFrame with some data\ndata = {'col1': range(1_000_000), 'col2': [f'item_{i}' for i in range(1_000_000)]}\ndf = pd.DataFrame(data)\n\n# Define a CPU-bound function to apply\ndef example_computation(x):\n    # Simulate a computationally intensive task\n    res = 0\n    for i in range(50):\n        res += (x * i) ** 0.5\n    return res\n\n# Apply the function in parallel using pandarallel's parallel_apply\n# This replaces df['col1'].apply(example_computation)\nprint(\"Starting parallel computation...\")\ndf['result'] = df['col1'].parallel_apply(example_computation)\n\nprint(\"Computation complete. First 5 rows of the DataFrame with results:\")\nprint(df.head())","lang":"python","description":"This quickstart demonstrates how to initialize pandarallel and then use `parallel_apply` on a Pandas Series. It includes a simple, CPU-bound function to showcase the parallelization effect. The `nb_workers` is explicitly set to the CPU count for clarity, and a progress bar is enabled."},"warnings":[{"fix":"Monitor memory usage. For data larger than available memory, consider alternatives like Dask or PySpark.","message":"Pandarallel can require up to twice the memory of standard Pandas operations. Ensure your system has sufficient RAM, especially for large datasets.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Define functions at the top level of the module. For complex scenarios, consider using Windows Subsystem for Linux (WSL) or refactor functions to be entirely self-contained.","message":"On Windows, functions passed to `pandarallel` must be self-contained and should not depend on external resources (e.g., global variables, complex closures) due to Python's `multiprocessing` 'spawn' start method.","severity":"gotcha","affected_versions":"All versions on Windows"},{"fix":"Benchmark performance with and without `pandarallel` for your specific use case to determine if parallelization is beneficial.","message":"Parallelization introduces overhead. For small datasets or very fast operations, `pandarallel` might not provide a speedup, or could even be slower than native Pandas.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For optimal performance, set `nb_workers` to your system's number of physical CPU cores or allow `pandarallel` to determine it automatically.","message":"Pandarallel scales best with the number of *physical* CPU cores, not necessarily logical cores (hyperthreading). Setting `nb_workers` higher than physical cores may not yield further performance gains.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Remove `shm_size_mb` from `pandarallel.initialize()` calls. Memory file system usage is now controlled by `use_memory_fs`.","message":"The `shm_size_mb` parameter in `pandarallel.initialize()` is deprecated and should no longer be used.","severity":"gotcha","affected_versions":">=1.x.x"},{"fix":"Monitor system CPU usage. Try reducing `nb_workers` to leave some cores free, or ensure your environment has sufficient idle CPU resources.","message":"Pandarallel can sometimes get stuck without raising errors if all physical cores are heavily utilized by other background processes. The progress bar might stop updating.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure functions passed to `pandarallel` methods are defined at the top level of a module, not nested within other functions.","message":"Functions defined locally (e.g., inside another function) or using closures may lead to `AttributeError: Can't pickle local object` errors, a common issue with Python's multiprocessing.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z"}