tinsel: PySpark schema generator

0.3.0 · maintenance · verified Thu Apr 16

Tinsel is a lightweight Python library designed to simplify PySpark DataFrame schema generation. It allows users to define complex PySpark schemas using familiar Python native types like `NamedTuple` and `dataclasses`, removing the need for verbose PySpark schema DSLs. The library is small, fast, and provides type shims for some Python types that might not have direct Spark equivalents. The current version is 0.3.0, with the last public update in September 2018, indicating a maintenance-level release cadence.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to define a PySpark schema using Tinsel with Python's `dataclasses` and `NamedTuple`. It then converts this definition into a `StructType` compatible with PySpark and creates a DataFrame with sample data.

from dataclasses import dataclass
from typing import NamedTuple, Optional, Dict, List
from tinsel import struct, transform
from pyspark.sql import SparkSession

# Define nested schema using dataclass
@struct
@dataclass
class UserInfo:
    hobby: List[str]
    last_seen: Optional[int]
    pet_ages: Dict[str, int]

# Define root schema using NamedTuple
@struct
class User(NamedTuple):
    login: str
    age: int
    active: bool
    info: Optional[UserInfo]

# Transform the Python class into a PySpark schema
spark_schema = transform(User)

# Prepare sample data matching the defined structure
data = [
    User(
        login="Ben",
        age=18,
        active=False,
        info=None
    ),
    User(
        login="Tom",
        age=32,
        active=True,
        info=UserInfo(
            hobby=["pets", "flowers"],
            last_seen=16,
            pet_ages={
                "Jack": 2,
                "Sunshine": 6
            }
        )
    )
]

# Initialize SparkSession
spark = SparkSession.builder.master('local').appName("TinselQuickstart").getOrCreate()

# Create DataFrame using the generated schema and data
df = spark.createDataFrame(data=data, schema=spark_schema)
df.printSchema()
df.show(truncate=False)

spark.stop()

view raw JSON →