PySpark Data Sources

0.1.11 verified Fri May 01 auth: no python

Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API. Current version: 0.1.11. Release cadence: irregular, low activity.

pip install pyspark-data-sources

Common errors

error ImportError: cannot import name 'DataSource' from 'pyspark_data_sources' ↓

cause Using wrong import path or package name.

fix

Use from pyspark_datasources import DataSource (note: no hyphen, 'datasources' not 'data_sources').

error ModuleNotFoundError: No module named 'pyspark' ↓

cause PySpark is not installed.

fix

Install PySpark: pip install pyspark.

error java.lang.UnsupportedClassVersionError: ... Unsupported major.minor version ↓

cause Java version mismatch with Spark installation.

fix

Install Java 8 or 11 and set JAVA_HOME accordingly.

error PySparkException: [INVALID_HANDLE.STATE] Cannot call methods on a stopped SparkSession. ↓

cause SparkSession was stopped before data source operations.

fix

Ensure SparkSession remains active until all DataFrame operations complete.

Warnings

gotcha The library is in early development (0.1.x). API may change without notice. ↓

fix Pin to a specific version and test upgrades carefully.

deprecated Python 3.13 is not supported (requires <3.13). Users on 3.13 must downgrade or use alternatives. ↓

fix Use Python 3.9, 3.10, 3.11, or 3.12.

gotcha The package name uses hyphens (pyspark-data-sources) but the import uses underscores (pyspark_datasources). This mismatch can cause ImportError. ↓

fix Use `import pyspark_datasources` (note: singular 'datasources' in import).

gotcha The library requires Java and a Spark installation. Not a pure Python solution. ↓

fix Ensure Java 8/11 and Spark are installed and SPARK_HOME is set.

breaking The DataSource class API may change; custom data source implementations depend on internal interfaces. ↓

fix Check the source code for the current DataSource abstract methods before implementing.

Imports

DataSource
wrong
```
import pyspark_data_sources
```
correct
```
from pyspark_datasources import DataSource
```
The top-level module is `pyspark_datasources`, not `pyspark-data-sources` or `pyspark_data_sources` as a standalone module.

Quickstart

Create a Spark session and use a custom data source via format(). Replace 'custom_source' with the actual source name.

from pyspark.sql import SparkSession
from pyspark_datasources import DataSource

spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.read.format('custom_source').load('path/to/data')
df.show()