Koheesio

0.10.6 · active · verified Sun Apr 12

Koheesio is a unified, composable, and scalable steps-based framework for data processing and ETL tasks built on top of Apache Spark. It simplifies the creation and orchestration of data pipelines by providing a structured way to define and execute steps. The current version is 0.10.6, and it maintains an active release cadence with frequent updates and bug fixes.

Warnings

Install

Imports

Quickstart

This example demonstrates how to initialize a Koheesio Spark session, define a custom `Step` using Pydantic for parameters, create a dummy DataFrame, and execute the step. Ensure a Spark environment is available or PySpark is installed.

import os
from koheesio.spark import KoheesioSparkSession
from koheesio.steps import Step
from pyspark.sql import DataFrame
from pyspark.sql.functions import lit

# 1. Define your Spark Koheesio Session
spark_session_name = "koheesio-example"
spark = KoheesioSparkSession.get_or_create(spark_session_name=spark_session_name)

# 2. Define a simple Step
class AddColumnStep(Step):
    """Adds a new column to the DataFrame."""
    value: str

    def execute(self, df: DataFrame) -> DataFrame:
        self.log.info(f"Adding column 'new_column' with value '{self.value}'")
        return df.withColumn("new_column", lit(self.value))

# 3. Create a dummy DataFrame
df_input = spark.createDataFrame([(1, "a"), (2, "b")], ["id", "col1"])
df_input.show()

# 4. Run your step
df_output = AddColumnStep(value="koheesio-rocks").execute(df_input)
df_output.show()

# 5. Stop the Spark session (optional in many environments)
# spark.stop()

view raw JSON →