{"id":6819,"library":"pyspark-pandas","title":"PySpark-Pandas","description":"PySpark-Pandas (version 0.0.7) is an early project that aimed to provide tools and algorithms for pandas DataFrames distributed on PySpark. Its last release was in 2016, and the project has since been abandoned. The PyPI description itself advises users to consider alternatives like SparklingPandas, and the official Apache Spark project now includes its own 'Pandas API on Spark' (formerly Koalas) for similar functionality, which is the recommended modern solution.","status":"abandoned","version":"0.0.7","language":"en","source_language":"en","source_url":"https://github.com/adgaudio/pyspark_pandas","tags":["pyspark","pandas","distributed","dataframe","legacy"],"install":[{"cmd":"pip install pyspark-pandas","lang":"bash","label":"Install PySpark-Pandas"}],"dependencies":[{"reason":"Core dependency for distributed DataFrame processing.","package":"pyspark","optional":false},{"reason":"Core dependency for pandas DataFrame compatibility.","package":"pandas","optional":false}],"imports":[{"note":"The `pyspark.pandas` module is the official 'Pandas API on Spark' and is a different, actively maintained project within Apache Spark. The 'pyspark-pandas' library (version 0.0.7) is a separate, abandoned project.","wrong":"import pyspark.pandas as ps","symbol":"DataFrame","correct":"from pyspark_pandas import DataFrame"}],"quickstart":{"code":"# The 'pyspark-pandas' (0.0.7) library is abandoned and lacks a functional, self-contained quickstart example\n# compatible with modern Spark/Python environments.\n# Its primary functionality would have involved wrapping Spark RDDs or DataFrames with a pandas-like interface.\n#\n# For modern 'Pandas API on Spark' functionality, use pyspark.pandas:\nfrom pyspark.sql import SparkSession\nimport pyspark.pandas as ps\nimport pandas as pd\n\n# Create a SparkSession\nspark = SparkSession.builder.appName(\"PandasOnSparkQuickstart\").getOrCreate()\n\n# Create a pandas-on-Spark DataFrame from a pandas DataFrame\npd_df = pd.DataFrame({\"col1\": [1, 2, 3], \"col2\": [4, 5, 6]})\nps_df = ps.from_pandas(pd_df)\n\nprint(\"Pandas-on-Spark DataFrame:\")\nprint(ps_df)\nprint(f\"Type: {type(ps_df)}\")\n\n# Perform a simple operation\nps_df['col3'] = ps_df['col1'] + ps_df['col2']\nprint(\"\\nDataFrame after operation:\")\nprint(ps_df)\n\n# Convert back to a pandas DataFrame (collects data to driver)\npandas_result = ps_df.to_pandas()\nprint(\"\\nResult as pandas DataFrame:\")\nprint(pandas_result)\n\nspark.stop()","lang":"python","description":"The `pyspark-pandas` library (version 0.0.7) is largely unmaintained and does not offer a readily available, functional quickstart. The provided code demonstrates a quickstart using the official 'Pandas API on Spark' (`pyspark.pandas`), which is the recommended alternative for distributed pandas-like operations in a modern PySpark environment."},"warnings":[{"fix":"Do not use `pyspark-pandas`. Instead, use `pyspark.pandas` which is included with PySpark (PySpark 3.2+). Install `pyspark` and import `pyspark.pandas as ps`.","message":"The `pyspark-pandas` (0.0.7) library is effectively abandoned since its last commit in 2016. It is highly unlikely to be compatible with modern versions of PySpark or Python, and its functionality has been superseded by the official 'Pandas API on Spark' (formerly Koalas) integrated directly into PySpark as `pyspark.pandas`.","severity":"breaking","affected_versions":"<=0.0.7"},{"fix":"For distributed pandas-like functionality, use `pyspark.pandas` (the official Pandas API on Spark) which is actively maintained and integrated into Apache Spark.","message":"The PyPI description for `pyspark-pandas` explicitly advises users to 'Please consider the SparklingPandas project before this one'. This indicates the project was considered superseded even at the time of its last update.","severity":"deprecated","affected_versions":"All (0.0.7)"},{"fix":"Always import `pyspark.pandas as ps` for the official Pandas API on Spark. The `pyspark-pandas` package should not be used.","message":"Confusing `pyspark-pandas` (the abandoned PyPI package) with `pyspark.pandas` (the official Pandas API on Spark) is a common mistake. They are distinct projects with different import paths and maintenance statuses.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}