Quinn PySpark Utilities
Quinn is a Python library providing helper methods for PySpark to enhance developer productivity. It offers DataFrame validation functions, useful column functions/DataFrame transformations, and performant helper functions. The library is currently at version 0.10.3 and maintains an active release cadence.
Warnings
- breaking Version 0.2.0 introduced significant breaking changes to the directory structure and import interfaces for PySpark extensions and functions.
- deprecated The `print_athena_create_table` functionality has been deprecated.
- gotcha Using wildcard imports (`from quinn import *` for the main `quinn` module, or even `from quinn.extensions import *` if only specific functions are needed) can make it difficult to trace where functions originate, potentially leading to name collisions.
- gotcha PySpark operations (including those using `quinn`) are lazily evaluated. Transformations build a logical plan and are only executed when an action (e.g., `show()`, `collect()`, `write()`) is called. This can be a common pitfall for Python developers used to immediate execution.
Install
-
pip install quinn
Imports
- quinn
import quinn
- extensions
from quinn.extensions import *
- F
from pyspark.sql import functions as F
Quickstart
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import quinn
from quinn.extensions import *
# Initialize SparkSession
spark = SparkSession.builder \
.appName("QuinnQuickstart") \
.master("local[*]") \
.getOrCreate()
# NOTE: `spark.create_df` and Column methods like `isTruthy()` are automatically
# available after `from quinn.extensions import *`
# Create a DataFrame using quinn's extended create_df method
data = [
("Alice", 1, "USA"),
("Bob", 2, "Canada"),
("Charlie", 3, "Mexico")
]
schema_def = [
("firstName", "string", True),
("id", "integer", True),
("country", "string", True)
]
df = spark.create_df(data, schema_def)
print("Original DataFrame Schema:")
df.printSchema()
print("Original DataFrame Data:")
df.show()
# Apply a quinn DataFrame transformation: snake_case_columns
snake_cased_df = quinn.snake_case_columns(df)
print("\nDataFrame with snake_cased columns:")
snake_cased_df.printSchema()
snake_cased_df.show()
# Demonstrate a Column extension (e.g., isTruthy from quinn.extensions)
from pyspark.sql import functions as F
extended_df = df.withColumn("is_id_truthy", F.col("id").isTruthy())
print("\nDataFrame with 'is_id_truthy' column (using quinn extension):")
extended_df.show()
# Stop SparkSession
spark.stop()