Koalas: pandas API on Apache Spark

1.8.2 · deprecated · verified Sat Apr 11

Koalas provides a pandas-compatible API that runs on Apache Spark, allowing users familiar with pandas to work with large, distributed datasets. The current version is 1.8.2. Its development as a standalone library has ceased, as its functionality has been officially integrated into PySpark as 'pandas API on Spark' starting with Apache Spark 3.2. Maintenance releases are infrequent, primarily addressing critical bug fixes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create a Koalas DataFrame from a pandas DataFrame, perform a basic operation (calculate mean), and convert it back to a pandas DataFrame. Ensure you have PySpark configured in your environment for this to run against a Spark session.

import databricks.koalas as ks
import pandas as pd

# Create a pandas DataFrame
pdf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Convert to a Koalas DataFrame
kdf = ks.DataFrame(pdf)

print("Koalas DataFrame head:")
print(kdf.head())

print("Mean of column 'A':", kdf['A'].mean())

# Convert back to a pandas DataFrame
pdf_result = kdf.to_pandas()
print("\nConverted back to pandas DataFrame:")
print(pdf_result)

view raw JSON →