PySpark Data Sources

raw JSON →
0.1.11 verified Fri May 01 auth: no python

Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API. Current version: 0.1.11. Release cadence: irregular, low activity.

pip install pyspark-data-sources
error ImportError: cannot import name 'DataSource' from 'pyspark_data_sources'
cause Using wrong import path or package name.
fix
Use from pyspark_datasources import DataSource (note: no hyphen, 'datasources' not 'data_sources').
error ModuleNotFoundError: No module named 'pyspark'
cause PySpark is not installed.
fix
Install PySpark: pip install pyspark.
error java.lang.UnsupportedClassVersionError: ... Unsupported major.minor version
cause Java version mismatch with Spark installation.
fix
Install Java 8 or 11 and set JAVA_HOME accordingly.
error PySparkException: [INVALID_HANDLE.STATE] Cannot call methods on a stopped SparkSession.
cause SparkSession was stopped before data source operations.
fix
Ensure SparkSession remains active until all DataFrame operations complete.
gotcha The library is in early development (0.1.x). API may change without notice.
fix Pin to a specific version and test upgrades carefully.
deprecated Python 3.13 is not supported (requires <3.13). Users on 3.13 must downgrade or use alternatives.
fix Use Python 3.9, 3.10, 3.11, or 3.12.
gotcha The package name uses hyphens (pyspark-data-sources) but the import uses underscores (pyspark_datasources). This mismatch can cause ImportError.
fix Use `import pyspark_datasources` (note: singular 'datasources' in import).
gotcha The library requires Java and a Spark installation. Not a pure Python solution.
fix Ensure Java 8/11 and Spark are installed and SPARK_HOME is set.
breaking The DataSource class API may change; custom data source implementations depend on internal interfaces.
fix Check the source code for the current DataSource abstract methods before implementing.

Create a Spark session and use a custom data source via format(). Replace 'custom_source' with the actual source name.

from pyspark.sql import SparkSession
from pyspark_datasources import DataSource

spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.read.format('custom_source').load('path/to/data')
df.show()