Koalas: pandas API on Apache Spark
Koalas provides a pandas-compatible API that runs on Apache Spark, allowing users familiar with pandas to work with large, distributed datasets. The current version is 1.8.2. Its development as a standalone library has ceased, as its functionality has been officially integrated into PySpark as 'pandas API on Spark' starting with Apache Spark 3.2. Maintenance releases are infrequent, primarily addressing critical bug fixes.
Warnings
- breaking Koalas as a standalone library is deprecated. All its functionality has been officially integrated into PySpark as 'pandas API on Spark' starting with Apache Spark 3.2. Users are strongly advised to migrate to PySpark directly.
- breaking The default plotting backend for Koalas switched from Matplotlib to Plotly in version 1.7.0. This can change the visual output and require different plotting options.
- gotcha Koalas historically had different behavior than pandas regarding unnamed Series. Prior to v1.2.0, Koalas would automatically name a Series '0' if no name was specified, unlike pandas which allows a truly unnamed Series. This was fixed in v1.2.0 to align with pandas.
- gotcha Compatibility with specific pandas versions can introduce subtle bugs. For example, Koalas 1.8.2 addressed an issue with `_builtin_table` import in `groupby.apply` that affected pandas versions 1.3.0 and above.
- gotcha Early versions of Koalas (pre-1.5.0) had limited or inconsistent support for complex Index operations (e.g., chained arithmetic operations), sometimes raising `AssertionError`.
Install
-
pip install koalas
Imports
- Koalas DataFrame/Series
import databricks.koalas as ks
- Koalas DataFrame/Series (from pandas)
import databricks.koalas as ks import pandas as pd
Quickstart
import databricks.koalas as ks
import pandas as pd
# Create a pandas DataFrame
pdf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Convert to a Koalas DataFrame
kdf = ks.DataFrame(pdf)
print("Koalas DataFrame head:")
print(kdf.head())
print("Mean of column 'A':", kdf['A'].mean())
# Convert back to a pandas DataFrame
pdf_result = kdf.to_pandas()
print("\nConverted back to pandas DataFrame:")
print(pdf_result)