PySpark-Pandas

0.0.7 · abandoned · verified Wed Apr 15

PySpark-Pandas (version 0.0.7) is an early project that aimed to provide tools and algorithms for pandas DataFrames distributed on PySpark. Its last release was in 2016, and the project has since been abandoned. The PyPI description itself advises users to consider alternatives like SparklingPandas, and the official Apache Spark project now includes its own 'Pandas API on Spark' (formerly Koalas) for similar functionality, which is the recommended modern solution.

Warnings

Install

Imports

Quickstart

The `pyspark-pandas` library (version 0.0.7) is largely unmaintained and does not offer a readily available, functional quickstart. The provided code demonstrates a quickstart using the official 'Pandas API on Spark' (`pyspark.pandas`), which is the recommended alternative for distributed pandas-like operations in a modern PySpark environment.

# The 'pyspark-pandas' (0.0.7) library is abandoned and lacks a functional, self-contained quickstart example
# compatible with modern Spark/Python environments.
# Its primary functionality would have involved wrapping Spark RDDs or DataFrames with a pandas-like interface.
#
# For modern 'Pandas API on Spark' functionality, use pyspark.pandas:
from pyspark.sql import SparkSession
import pyspark.pandas as ps
import pandas as pd

# Create a SparkSession
spark = SparkSession.builder.appName("PandasOnSparkQuickstart").getOrCreate()

# Create a pandas-on-Spark DataFrame from a pandas DataFrame
pd_df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
ps_df = ps.from_pandas(pd_df)

print("Pandas-on-Spark DataFrame:")
print(ps_df)
print(f"Type: {type(ps_df)}")

# Perform a simple operation
ps_df['col3'] = ps_df['col1'] + ps_df['col2']
print("\nDataFrame after operation:")
print(ps_df)

# Convert back to a pandas DataFrame (collects data to driver)
pandas_result = ps_df.to_pandas()
print("\nResult as pandas DataFrame:")
print(pandas_result)

spark.stop()

view raw JSON →