PySpark Distribution Explorer

0.1.8 · maintenance · verified Fri Apr 17

PySpark Distribution Explorer (pyspark-dist-explore, current version 0.1.8) is a Python library that enables creating histogram and density plots directly from PySpark DataFrames. It simplifies exploratory data analysis (EDA) for large datasets by leveraging Matplotlib and Pandas to visualize distributions. The project is currently in maintenance mode with infrequent updates.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize a SparkSession, create a sample DataFrame, and then use `hist`, `density_plot`, and `describe_pd` to visualize and summarize numerical distributions. Remember to call `plt.show()` to display the plots.

from pyspark_dist_explore import hist, density_plot, describe_pd
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import os

# Ensure SparkSession is available (replace with your actual Spark setup)
# For local testing, ensure pyspark is installed: pip install pyspark
spark = SparkSession.builder.appName("DistExploreQuickstart").getOrCreate()

# Create a sample PySpark DataFrame
data = [
    (1, "A", 10.5),
    (2, "B", 12.0),
    (3, "A", 11.2),
    (4, "C", 9.8),
    (5, "B", 13.1),
    (6, "A", 10.8),
    (7, "C", 9.5),
    (8, "B", 12.5),
    (9, "A", 11.0),
    (10, "C", 10.0)
]
columns = ["id", "category", "value"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()

# 1. Generate a histogram
fig_hist, ax_hist = plt.subplots()
hist(ax_hist, df.select('value'), bins=5, color='skyblue', edgecolor='black')
ax_hist.set_title('Histogram of Value')
ax_hist.set_xlabel('Value')
ax_hist.set_ylabel('Frequency')
plt.tight_layout()
plt.show() # Display the plot

# 2. Generate a density plot
fig_density, ax_density = plt.subplots()
density_plot(ax_density, df.select('value'), color='green', fill=True, alpha=0.5)
ax_density.set_title('Density Plot of Value')
ax_density.set_xlabel('Value')
ax_density.set_ylabel('Density')
plt.tight_layout()
plt.show() # Display the plot

# 3. Get descriptive statistics as a Pandas DataFrame
desc_df = describe_pd(df.select('value'))
print("\nDescriptive Statistics (Pandas DataFrame):")
print(desc_df)

# Stop the SparkSession
spark.stop()

view raw JSON →