{"library":"repartipy","title":"repartipy: PySpark DataFrame Partition Size Helper","description":"repartipy is a Python library designed to assist with managing PySpark DataFrame partition sizes. It provides a function to repartition a DataFrame based on a target partition size in megabytes, aiming to optimize storage and processing efficiency. As of version 0.1.8, it's a relatively stable and focused utility, with updates likely driven by PySpark compatibility or feature requests rather than a fixed cadence.","language":"python","status":"active","last_verified":"Fri Apr 17","install":{"commands":["pip install repartipy"],"cli":null},"imports":["from repartipy import repartition_by_size"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import lit\nfrom repartipy import repartition_by_size\n\n# Create a SparkSession\nspark = SparkSession.builder \\\n    .appName(\"repartipy_quickstart\") \\\n    .master(\"local[*]\") \\\n    .config(\"spark.ui.enabled\", \"false\") \\\n    .getOrCreate()\n\ntry:\n    # Create a dummy DataFrame with approx 100MB of data for demonstration\n    # (adjust range and string size to control actual data size)\n    data = [(i, f\"value_{i}\") for i in range(100000)]\n    df = spark.createDataFrame(data, [\"id\", \"value\"])\n    # Add a large column to increase row size for a more realistic scenario\n    df = df.withColumn(\"large_string\", lit(\"x\" * 500))\n\n    print(f\"Initial DataFrame row count: {df.count()}\")\n    print(f\"Initial partition count: {df.rdd.getNumPartitions()}\")\n\n    # Repartition the DataFrame to aim for 10MB partitions\n    target_size_mb = 10\n    repartitioned_df = repartition_by_size(df, target_size_mb, spark=spark)\n\n    print(f\"Repartitioned DataFrame row count: {repartitioned_df.count()}\")\n    print(f\"New partition count (target ~{target_size_mb}MB per partition): {repartitioned_df.rdd.getNumPartitions()}\")\n\n    # Perform an action to trigger the repartitioning and check the result\n    # e.g., repartitioned_df.write.mode(\"overwrite\").parquet(\"/tmp/repartipy_output\")\n\nfinally:\n    spark.stop()","lang":"python","description":"This quickstart demonstrates how to initialize a SparkSession, create a sample DataFrame, and then use `repartition_by_size` to optimize its partitions. The example aims for 10MB partitions, showing the initial and resulting partition counts. Remember that repartitioning creates a new DataFrame and requires an action (like writing data or collecting) to trigger actual computation.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}