{"library":"pyspark-dist-explore","title":"PySpark Distribution Explorer","description":"PySpark Distribution Explorer (pyspark-dist-explore, current version 0.1.8) is a Python library that enables creating histogram and density plots directly from PySpark DataFrames. It simplifies exploratory data analysis (EDA) for large datasets by leveraging Matplotlib and Pandas to visualize distributions. The project is currently in maintenance mode with infrequent updates.","language":"python","status":"maintenance","last_verified":"Fri Apr 17","install":{"commands":["pip install pyspark-dist-explore"],"cli":null},"imports":["from pyspark_dist_explore import hist","from pyspark_dist_explore import density_plot","from pyspark_dist_explore import describe_pd"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"from pyspark_dist_explore import hist, density_plot, describe_pd\nfrom pyspark.sql import SparkSession\nimport matplotlib.pyplot as plt\nimport os\n\n# Ensure SparkSession is available (replace with your actual Spark setup)\n# For local testing, ensure pyspark is installed: pip install pyspark\nspark = SparkSession.builder.appName(\"DistExploreQuickstart\").getOrCreate()\n\n# Create a sample PySpark DataFrame\ndata = [\n    (1, \"A\", 10.5),\n    (2, \"B\", 12.0),\n    (3, \"A\", 11.2),\n    (4, \"C\", 9.8),\n    (5, \"B\", 13.1),\n    (6, \"A\", 10.8),\n    (7, \"C\", 9.5),\n    (8, \"B\", 12.5),\n    (9, \"A\", 11.0),\n    (10, \"C\", 10.0)\n]\ncolumns = [\"id\", \"category\", \"value\"]\ndf = spark.createDataFrame(data, columns)\n\nprint(\"Original DataFrame:\")\ndf.show()\n\n# 1. Generate a histogram\nfig_hist, ax_hist = plt.subplots()\nhist(ax_hist, df.select('value'), bins=5, color='skyblue', edgecolor='black')\nax_hist.set_title('Histogram of Value')\nax_hist.set_xlabel('Value')\nax_hist.set_ylabel('Frequency')\nplt.tight_layout()\nplt.show() # Display the plot\n\n# 2. Generate a density plot\nfig_density, ax_density = plt.subplots()\ndensity_plot(ax_density, df.select('value'), color='green', fill=True, alpha=0.5)\nax_density.set_title('Density Plot of Value')\nax_density.set_xlabel('Value')\nax_density.set_ylabel('Density')\nplt.tight_layout()\nplt.show() # Display the plot\n\n# 3. Get descriptive statistics as a Pandas DataFrame\ndesc_df = describe_pd(df.select('value'))\nprint(\"\\nDescriptive Statistics (Pandas DataFrame):\")\nprint(desc_df)\n\n# Stop the SparkSession\nspark.stop()\n","lang":"python","description":"This quickstart demonstrates how to initialize a SparkSession, create a sample DataFrame, and then use `hist`, `density_plot`, and `describe_pd` to visualize and summarize numerical distributions. Remember to call `plt.show()` to display the plots.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}