PySpark-Pandas
PySpark-Pandas (version 0.0.7) is an early project that aimed to provide tools and algorithms for pandas DataFrames distributed on PySpark. Its last release was in 2016, and the project has since been abandoned. The PyPI description itself advises users to consider alternatives like SparklingPandas, and the official Apache Spark project now includes its own 'Pandas API on Spark' (formerly Koalas) for similar functionality, which is the recommended modern solution.
Warnings
- breaking The `pyspark-pandas` (0.0.7) library is effectively abandoned since its last commit in 2016. It is highly unlikely to be compatible with modern versions of PySpark or Python, and its functionality has been superseded by the official 'Pandas API on Spark' (formerly Koalas) integrated directly into PySpark as `pyspark.pandas`.
- deprecated The PyPI description for `pyspark-pandas` explicitly advises users to 'Please consider the SparklingPandas project before this one'. This indicates the project was considered superseded even at the time of its last update.
- gotcha Confusing `pyspark-pandas` (the abandoned PyPI package) with `pyspark.pandas` (the official Pandas API on Spark) is a common mistake. They are distinct projects with different import paths and maintenance statuses.
Install
-
pip install pyspark-pandas
Imports
- DataFrame
import pyspark.pandas as ps
from pyspark_pandas import DataFrame
Quickstart
# The 'pyspark-pandas' (0.0.7) library is abandoned and lacks a functional, self-contained quickstart example
# compatible with modern Spark/Python environments.
# Its primary functionality would have involved wrapping Spark RDDs or DataFrames with a pandas-like interface.
#
# For modern 'Pandas API on Spark' functionality, use pyspark.pandas:
from pyspark.sql import SparkSession
import pyspark.pandas as ps
import pandas as pd
# Create a SparkSession
spark = SparkSession.builder.appName("PandasOnSparkQuickstart").getOrCreate()
# Create a pandas-on-Spark DataFrame from a pandas DataFrame
pd_df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
ps_df = ps.from_pandas(pd_df)
print("Pandas-on-Spark DataFrame:")
print(ps_df)
print(f"Type: {type(ps_df)}")
# Perform a simple operation
ps_df['col3'] = ps_df['col1'] + ps_df['col2']
print("\nDataFrame after operation:")
print(ps_df)
# Convert back to a pandas DataFrame (collects data to driver)
pandas_result = ps_df.to_pandas()
print("\nResult as pandas DataFrame:")
print(pandas_result)
spark.stop()