tinsel: PySpark schema generator
Tinsel is a lightweight Python library designed to simplify PySpark DataFrame schema generation. It allows users to define complex PySpark schemas using familiar Python native types like `NamedTuple` and `dataclasses`, removing the need for verbose PySpark schema DSLs. The library is small, fast, and provides type shims for some Python types that might not have direct Spark equivalents. The current version is 0.3.0, with the last public update in September 2018, indicating a maintenance-level release cadence.
Common errors
-
ModuleNotFoundError: No module named 'tinsel'
cause The `tinsel` library is not installed in the current Python environment.fixRun `pip install tinsel` to install the library. -
TypeError: 'StructType' object is not callable
cause Attempting to call the result of `transform(YourClass)` as if it were a function, or misusing the generated schema object.fixEnsure the output of `transform()` is assigned to a variable (e.g., `schema = transform(YourClass)`) and then passed to PySpark's `createDataFrame` using the `schema=` keyword argument (e.g., `spark.createDataFrame(data, schema=schema)`). -
AttributeError: 'module' object has no attribute 'struct' or 'transform'
cause This usually means `struct` or `transform` was imported incorrectly, or the `tinsel` package itself is not properly installed or is shadowed by another module.fixVerify that `from tinsel import struct, transform` is used. Check your Python environment for any conflicting packages named `tinsel` or issues with the installation.
Warnings
- gotcha The `tinsel` library has not seen updates since September 2018. While its core functionality remains valid, it might not be compatible with the absolute latest features or changes in very recent PySpark versions or Python language constructs.
- gotcha Tinsel handles nullable fields and provides 'type shims' for certain Python types that don't have direct PySpark equivalents (e.g., `long` or `short`). Users should be aware of how these types are mapped to avoid unexpected schema interpretations.
Install
-
pip install tinsel
Imports
- struct
from tinsel.schema import struct
from tinsel import struct
- transform
import tinsel.transform
from tinsel import transform
Quickstart
from dataclasses import dataclass
from typing import NamedTuple, Optional, Dict, List
from tinsel import struct, transform
from pyspark.sql import SparkSession
# Define nested schema using dataclass
@struct
@dataclass
class UserInfo:
hobby: List[str]
last_seen: Optional[int]
pet_ages: Dict[str, int]
# Define root schema using NamedTuple
@struct
class User(NamedTuple):
login: str
age: int
active: bool
info: Optional[UserInfo]
# Transform the Python class into a PySpark schema
spark_schema = transform(User)
# Prepare sample data matching the defined structure
data = [
User(
login="Ben",
age=18,
active=False,
info=None
),
User(
login="Tom",
age=32,
active=True,
info=UserInfo(
hobby=["pets", "flowers"],
last_seen=16,
pet_ages={
"Jack": 2,
"Sunshine": 6
}
)
)
]
# Initialize SparkSession
spark = SparkSession.builder.master('local').appName("TinselQuickstart").getOrCreate()
# Create DataFrame using the generated schema and data
df = spark.createDataFrame(data=data, schema=spark_schema)
df.printSchema()
df.show(truncate=False)
spark.stop()