{"id":10135,"library":"pyspark-dist-explore","title":"PySpark Distribution Explorer","description":"PySpark Distribution Explorer (pyspark-dist-explore, current version 0.1.8) is a Python library that enables creating histogram and density plots directly from PySpark DataFrames. It simplifies exploratory data analysis (EDA) for large datasets by leveraging Matplotlib and Pandas to visualize distributions. The project is currently in maintenance mode with infrequent updates.","status":"maintenance","version":"0.1.8","language":"en","source_language":"en","source_url":"https://github.com/mozilla/pyspark-dist-explore","tags":["pyspark","data-visualization","eda","histogram","density-plot","matplotlib","big-data"],"install":[{"cmd":"pip install pyspark-dist-explore","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core functionality relies on PySpark DataFrames.","package":"pyspark","optional":false},{"reason":"Used for generating histogram and density plots.","package":"matplotlib","optional":false},{"reason":"Used for `describe_pd` functionality and internal data handling for plotting.","package":"pandas","optional":false}],"imports":[{"symbol":"hist","correct":"from pyspark_dist_explore import hist"},{"symbol":"density_plot","correct":"from pyspark_dist_explore import density_plot"},{"symbol":"describe_pd","correct":"from pyspark_dist_explore import describe_pd"}],"quickstart":{"code":"from pyspark_dist_explore import hist, density_plot, describe_pd\nfrom pyspark.sql import SparkSession\nimport matplotlib.pyplot as plt\nimport os\n\n# Ensure SparkSession is available (replace with your actual Spark setup)\n# For local testing, ensure pyspark is installed: pip install pyspark\nspark = SparkSession.builder.appName(\"DistExploreQuickstart\").getOrCreate()\n\n# Create a sample PySpark DataFrame\ndata = [\n    (1, \"A\", 10.5),\n    (2, \"B\", 12.0),\n    (3, \"A\", 11.2),\n    (4, \"C\", 9.8),\n    (5, \"B\", 13.1),\n    (6, \"A\", 10.8),\n    (7, \"C\", 9.5),\n    (8, \"B\", 12.5),\n    (9, \"A\", 11.0),\n    (10, \"C\", 10.0)\n]\ncolumns = [\"id\", \"category\", \"value\"]\ndf = spark.createDataFrame(data, columns)\n\nprint(\"Original DataFrame:\")\ndf.show()\n\n# 1. Generate a histogram\nfig_hist, ax_hist = plt.subplots()\nhist(ax_hist, df.select('value'), bins=5, color='skyblue', edgecolor='black')\nax_hist.set_title('Histogram of Value')\nax_hist.set_xlabel('Value')\nax_hist.set_ylabel('Frequency')\nplt.tight_layout()\nplt.show() # Display the plot\n\n# 2. Generate a density plot\nfig_density, ax_density = plt.subplots()\ndensity_plot(ax_density, df.select('value'), color='green', fill=True, alpha=0.5)\nax_density.set_title('Density Plot of Value')\nax_density.set_xlabel('Value')\nax_density.set_ylabel('Density')\nplt.tight_layout()\nplt.show() # Display the plot\n\n# 3. Get descriptive statistics as a Pandas DataFrame\ndesc_df = describe_pd(df.select('value'))\nprint(\"\\nDescriptive Statistics (Pandas DataFrame):\")\nprint(desc_df)\n\n# Stop the SparkSession\nspark.stop()\n","lang":"python","description":"This quickstart demonstrates how to initialize a SparkSession, create a sample DataFrame, and then use `hist`, `density_plot`, and `describe_pd` to visualize and summarize numerical distributions. Remember to call `plt.show()` to display the plots."},"warnings":[{"fix":"Initialize `fig, ax = plt.subplots()` before calling plotting functions like `hist(ax, ...)`.","message":"PySpark-dist-explore functions (`hist`, `density_plot`) require a `matplotlib.axes.Axes` object as their first argument. You must create a Matplotlib figure and axes explicitly before calling these functions.","severity":"gotcha","affected_versions":"0.1.x"},{"fix":"Use `df.select('column_name')` to pass only the relevant numerical column to the plotting function, e.g., `hist(ax, df.select('my_numeric_column'))`.","message":"The plotting functions (`hist`, `density_plot`) expect a PySpark DataFrame containing *only* the numerical column(s) you wish to plot. Do not pass the entire DataFrame if it contains non-numerical columns or multiple columns.","severity":"gotcha","affected_versions":"0.1.x"},{"fix":"Always call `plt.show()` after generating plots. For non-interactive environments, consider saving the figure: `plt.savefig('my_plot.png')` or configuring a non-interactive backend like `agg` (though this won't show plots interactively). For Jupyter/IPython, ensure `%matplotlib inline` or `%matplotlib notebook` is set.","message":"If plots are not displaying in non-interactive environments (e.g., scripts, remote servers), it might be due to Matplotlib's backend. `plt.show()` is crucial, but an interactive backend might also be needed.","severity":"gotcha","affected_versions":"0.1.x"},{"fix":"Be aware of potential compatibility issues with very recent PySpark versions. For critical new projects, consider alternative, more actively maintained PySpark visualization libraries if available or roll your own using PySpark's RDD/DataFrame operations combined with Matplotlib/Seaborn.","message":"The library is in maintenance mode with its last release in 2019 (0.1.8). While functional, it might not receive updates for newer PySpark versions or advanced features, and bug fixes are unlikely.","severity":"deprecated","affected_versions":"< 0.1.8"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Ensure you call `plt.show()` after generating your plot. If running in an interactive environment (like Jupyter), use `%matplotlib inline` or `%matplotlib notebook`. Otherwise, save the figure with `plt.savefig('plot.png')`.","cause":"Matplotlib is configured to use a non-interactive backend (like 'agg') which doesn't display plots to the screen automatically, and `plt.show()` was likely not called.","error":"UserWarning: Matplotlib is currently using agg, which is a non-interactive backend, so figures will not be shown."},{"fix":"Ensure the column you are plotting is of a numeric type (IntegerType, FloatType, DoubleType). Cast the column if necessary: `df.withColumn('numeric_col', df['string_col'].cast('double')).select('numeric_col')`.","cause":"You passed a PySpark DataFrame column with a non-numeric data type (e.g., StringType) to a plotting function that expects numerical data.","error":"TypeError: cannot convert 'StringType' object to float"},{"fix":"PySpark-dist-explore functions are standalone. Instead of `df.plot()`, use `hist(ax, df.select('column_name'))` or `density_plot(ax, df.select('column_name'))` after initializing `fig, ax = plt.subplots()`.","cause":"You are trying to call a `.plot()` method directly on a PySpark DataFrame, which is not supported by PySpark itself.","error":"AttributeError: 'DataFrame' object has no attribute 'plot'"},{"fix":"Make sure to initialize your SparkSession: `from pyspark.sql import SparkSession; spark = SparkSession.builder.appName(\"MyApp\").getOrCreate()`.","cause":"The `SparkSession` object named `spark` was not created or is out of scope before being used.","error":"NameError: name 'spark' is not defined"}]}