PySpark Distribution Explorer
PySpark Distribution Explorer (pyspark-dist-explore, current version 0.1.8) is a Python library that enables creating histogram and density plots directly from PySpark DataFrames. It simplifies exploratory data analysis (EDA) for large datasets by leveraging Matplotlib and Pandas to visualize distributions. The project is currently in maintenance mode with infrequent updates.
Common errors
-
UserWarning: Matplotlib is currently using agg, which is a non-interactive backend, so figures will not be shown.
cause Matplotlib is configured to use a non-interactive backend (like 'agg') which doesn't display plots to the screen automatically, and `plt.show()` was likely not called.fixEnsure you call `plt.show()` after generating your plot. If running in an interactive environment (like Jupyter), use `%matplotlib inline` or `%matplotlib notebook`. Otherwise, save the figure with `plt.savefig('plot.png')`. -
TypeError: cannot convert 'StringType' object to float
cause You passed a PySpark DataFrame column with a non-numeric data type (e.g., StringType) to a plotting function that expects numerical data.fixEnsure the column you are plotting is of a numeric type (IntegerType, FloatType, DoubleType). Cast the column if necessary: `df.withColumn('numeric_col', df['string_col'].cast('double')).select('numeric_col')`. -
AttributeError: 'DataFrame' object has no attribute 'plot'
cause You are trying to call a `.plot()` method directly on a PySpark DataFrame, which is not supported by PySpark itself.fixPySpark-dist-explore functions are standalone. Instead of `df.plot()`, use `hist(ax, df.select('column_name'))` or `density_plot(ax, df.select('column_name'))` after initializing `fig, ax = plt.subplots()`. -
NameError: name 'spark' is not defined
cause The `SparkSession` object named `spark` was not created or is out of scope before being used.fixMake sure to initialize your SparkSession: `from pyspark.sql import SparkSession; spark = SparkSession.builder.appName("MyApp").getOrCreate()`.
Warnings
- gotcha PySpark-dist-explore functions (`hist`, `density_plot`) require a `matplotlib.axes.Axes` object as their first argument. You must create a Matplotlib figure and axes explicitly before calling these functions.
- gotcha The plotting functions (`hist`, `density_plot`) expect a PySpark DataFrame containing *only* the numerical column(s) you wish to plot. Do not pass the entire DataFrame if it contains non-numerical columns or multiple columns.
- gotcha If plots are not displaying in non-interactive environments (e.g., scripts, remote servers), it might be due to Matplotlib's backend. `plt.show()` is crucial, but an interactive backend might also be needed.
- deprecated The library is in maintenance mode with its last release in 2019 (0.1.8). While functional, it might not receive updates for newer PySpark versions or advanced features, and bug fixes are unlikely.
Install
-
pip install pyspark-dist-explore
Imports
- hist
from pyspark_dist_explore import hist
- density_plot
from pyspark_dist_explore import density_plot
- describe_pd
from pyspark_dist_explore import describe_pd
Quickstart
from pyspark_dist_explore import hist, density_plot, describe_pd
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import os
# Ensure SparkSession is available (replace with your actual Spark setup)
# For local testing, ensure pyspark is installed: pip install pyspark
spark = SparkSession.builder.appName("DistExploreQuickstart").getOrCreate()
# Create a sample PySpark DataFrame
data = [
(1, "A", 10.5),
(2, "B", 12.0),
(3, "A", 11.2),
(4, "C", 9.8),
(5, "B", 13.1),
(6, "A", 10.8),
(7, "C", 9.5),
(8, "B", 12.5),
(9, "A", 11.0),
(10, "C", 10.0)
]
columns = ["id", "category", "value"]
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
# 1. Generate a histogram
fig_hist, ax_hist = plt.subplots()
hist(ax_hist, df.select('value'), bins=5, color='skyblue', edgecolor='black')
ax_hist.set_title('Histogram of Value')
ax_hist.set_xlabel('Value')
ax_hist.set_ylabel('Frequency')
plt.tight_layout()
plt.show() # Display the plot
# 2. Generate a density plot
fig_density, ax_density = plt.subplots()
density_plot(ax_density, df.select('value'), color='green', fill=True, alpha=0.5)
ax_density.set_title('Density Plot of Value')
ax_density.set_xlabel('Value')
ax_density.set_ylabel('Density')
plt.tight_layout()
plt.show() # Display the plot
# 3. Get descriptive statistics as a Pandas DataFrame
desc_df = describe_pd(df.select('value'))
print("\nDescriptive Statistics (Pandas DataFrame):")
print(desc_df)
# Stop the SparkSession
spark.stop()