{"id":4600,"library":"koheesio","title":"Koheesio","description":"Koheesio is a unified, composable, and scalable steps-based framework for data processing and ETL tasks built on top of Apache Spark. It simplifies the creation and orchestration of data pipelines by providing a structured way to define and execute steps. The current version is 0.10.6, and it maintains an active release cadence with frequent updates and bug fixes.","status":"active","version":"0.10.6","language":"en","source_language":"en","source_url":"https://github.com/Nike-Inc/koheesio","tags":["data-processing","spark","etl","pipeline","dataframe","snowflake","databricks","pydantic"],"install":[{"cmd":"pip install koheesio","lang":"bash","label":"Basic Installation"},{"cmd":"pip install 'koheesio[snowflake]' # for Snowflake features","lang":"bash","label":"With Snowflake Support"},{"cmd":"pip install 'koheesio[all]' # for all optional dependencies","lang":"bash","label":"With All Features"}],"dependencies":[{"reason":"Core dependency for all Spark-based operations. Requires version >=3.3.0,<3.6.0.","package":"pyspark","optional":false},{"reason":"Used for defining step parameters and validation. Requires version >=2.0.0.","package":"pydantic","optional":false},{"reason":"Required for Snowflake integration features.","package":"snowflake-connector-python","optional":true},{"reason":"Required for Delta Lake integration features.","package":"delta-spark","optional":true},{"reason":"Required for Databricks SQL connectivity.","package":"databricks-sql-connector","optional":true}],"imports":[{"note":"Use this to get or create the managed Spark session.","symbol":"KoheesioSparkSession","correct":"from koheesio.spark import KoheesioSparkSession"},{"note":"Base class for defining custom data processing steps.","symbol":"Step","correct":"from koheesio.steps import Step"},{"note":"Base class for orchestrating multiple steps.","symbol":"Pipeline","correct":"from koheesio.pipelines import Pipeline"},{"note":"Example of a common built-in reader step.","symbol":"CsvReader","correct":"from koheesio.steps.readers import CsvReader"}],"quickstart":{"code":"import os\nfrom koheesio.spark import KoheesioSparkSession\nfrom koheesio.steps import Step\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import lit\n\n# 1. Define your Spark Koheesio Session\nspark_session_name = \"koheesio-example\"\nspark = KoheesioSparkSession.get_or_create(spark_session_name=spark_session_name)\n\n# 2. Define a simple Step\nclass AddColumnStep(Step):\n    \"\"\"Adds a new column to the DataFrame.\"\"\"\n    value: str\n\n    def execute(self, df: DataFrame) -> DataFrame:\n        self.log.info(f\"Adding column 'new_column' with value '{self.value}'\")\n        return df.withColumn(\"new_column\", lit(self.value))\n\n# 3. Create a dummy DataFrame\ndf_input = spark.createDataFrame([(1, \"a\"), (2, \"b\")], [\"id\", \"col1\"])\ndf_input.show()\n\n# 4. Run your step\ndf_output = AddColumnStep(value=\"koheesio-rocks\").execute(df_input)\ndf_output.show()\n\n# 5. Stop the Spark session (optional in many environments)\n# spark.stop()\n","lang":"python","description":"This example demonstrates how to initialize a Koheesio Spark session, define a custom `Step` using Pydantic for parameters, create a dummy DataFrame, and execute the step. Ensure a Spark environment is available or PySpark is installed."},"warnings":[{"fix":"Ensure your environment's `pyspark` version is within the specified range. For example, `pip install pyspark==3.5.0`.","message":"Koheesio has strict `pyspark` version requirements (`>=3.3.0,<3.6.0`). Using Spark versions outside this range, especially attempting 'Spark Connect' with unsupported versions, will lead to errors.","severity":"breaking","affected_versions":"All versions 0.10.x"},{"fix":"Install Koheesio with the required extras, for example: `pip install 'koheesio[snowflake]'` or `pip install 'koheesio[all]'`.","message":"Specific functionalities like Snowflake, Databricks, or Delta Lake integration require installing Koheesio with their respective optional dependencies (e.g., `pip install 'koheesio[snowflake]'`).","severity":"gotcha","affected_versions":"All versions"},{"fix":"Upgrade or downgrade your Python interpreter to a compatible version (e.g., Python 3.9, 3.10, 3.11, or 3.12).","message":"Koheesio requires Python versions `>=3.9` and `<3.13`. Ensure your Python environment meets these requirements to avoid compatibility issues.","severity":"gotcha","affected_versions":"All versions 0.10.x"},{"fix":"Review Pydantic V2 migration guides. Update `Step` parameter definitions and ensure correct Pydantic V2 syntax is used.","message":"Koheesio now explicitly requires Pydantic V2 (`>=2.0.0`). If you are migrating from an older Koheesio version that used Pydantic V1, you may need to update your custom `Step` definitions due to Pydantic's breaking API changes between major versions.","severity":"breaking","affected_versions":"Likely from 0.x.x to 0.10.x if Pydantic V1 was previously used."}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}