PySpark Data Sources
raw JSON → 0.1.11 verified Fri May 01 auth: no python
Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API. Current version: 0.1.11. Release cadence: irregular, low activity.
pip install pyspark-data-sources Common errors
error ImportError: cannot import name 'DataSource' from 'pyspark_data_sources' ↓
cause Using wrong import path or package name.
fix
Use
from pyspark_datasources import DataSource (note: no hyphen, 'datasources' not 'data_sources'). error ModuleNotFoundError: No module named 'pyspark' ↓
cause PySpark is not installed.
fix
Install PySpark:
pip install pyspark. error java.lang.UnsupportedClassVersionError: ... Unsupported major.minor version ↓
cause Java version mismatch with Spark installation.
fix
Install Java 8 or 11 and set JAVA_HOME accordingly.
error PySparkException: [INVALID_HANDLE.STATE] Cannot call methods on a stopped SparkSession. ↓
cause SparkSession was stopped before data source operations.
fix
Ensure SparkSession remains active until all DataFrame operations complete.
Warnings
gotcha The library is in early development (0.1.x). API may change without notice. ↓
fix Pin to a specific version and test upgrades carefully.
deprecated Python 3.13 is not supported (requires <3.13). Users on 3.13 must downgrade or use alternatives. ↓
fix Use Python 3.9, 3.10, 3.11, or 3.12.
gotcha The package name uses hyphens (pyspark-data-sources) but the import uses underscores (pyspark_datasources). This mismatch can cause ImportError. ↓
fix Use `import pyspark_datasources` (note: singular 'datasources' in import).
gotcha The library requires Java and a Spark installation. Not a pure Python solution. ↓
fix Ensure Java 8/11 and Spark are installed and SPARK_HOME is set.
breaking The DataSource class API may change; custom data source implementations depend on internal interfaces. ↓
fix Check the source code for the current DataSource abstract methods before implementing.
Imports
- DataSource wrong
import pyspark_data_sourcescorrectfrom pyspark_datasources import DataSource
Quickstart
from pyspark.sql import SparkSession
from pyspark_datasources import DataSource
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.read.format('custom_source').load('path/to/data')
df.show()