Snowpark Connect
Snowpark Connect (current version 1.21.1) allows developers to run Snowpark Python code locally using a local Spark cluster, emulating Snowpark functionalities without requiring a direct Snowflake connection. This facilitates offline development, testing, and CI/CD pipelines. It receives updates typically aligned with Snowpark Python and underlying Spark/Snowflake connector releases, and is actively maintained by Snowflake Labs.
Common errors
-
Error: JAVA_HOME is not set
cause The `JAVA_HOME` environment variable, which points to your Java installation, is not configured.fixSet the `JAVA_HOME` environment variable to the path of your Java JDK/JRE installation (e.g., `/usr/lib/jvm/java-11-openjdk-amd64`). Also ensure Java is in your system's PATH. -
java.io.IOException: Cannot run program "java": error=2, No such file or directory
cause The 'java' executable is not found in your system's PATH, or `JAVA_HOME` is incorrectly set.fixVerify Java is installed and its `bin` directory is included in your system's PATH, or that `JAVA_HOME` points directly to the Java installation directory, and the `bin` subdirectory contains `java`. -
org.apache.spark.SparkException: Cannot find any SparkSession in the current JVM.
cause This error typically indicates that the Spark session was not properly initialized or has been stopped before being used, or there's an issue with the underlying Spark environment.fixEnsure `connect_with_spark_session_builder()` and `getOrCreateSnowparkSession()` are called successfully before any Snowpark DataFrame operations. Check Spark configuration, especially `spark.jars.packages` for correctness. -
ModuleNotFoundError: No module named 'pyspark'
cause Although `pyspark` might be installed, Python cannot find it, often due to environment path issues or `findspark` not being initialized.fixIf `pyspark` is installed, add `import findspark; findspark.init()` at the beginning of your script. Ensure `pyspark` is indeed installed with `pip show pyspark`.
Warnings
- breaking Snowpark Connect has specific Python version requirements (currently >=3.10, <3.13). Using incompatible Python versions can lead to installation failures or runtime errors.
- gotcha A `java.io.IOException` or similar error indicating 'Cannot run program "java"' often means the `JAVA_HOME` environment variable is not set correctly, or Java is not installed or discoverable in your system's PATH.
- gotcha Incorrect or outdated `spark.jars.packages` values in the `config` dictionary can lead to runtime errors when Snowpark Connect tries to load Spark-Snowflake connector JARs, preventing proper emulation.
- gotcha It's common to confuse imports: `snowpark_connect` provides the session *builder*, but core Snowpark objects like `Session`, `DataFrame`, and `functions` are imported directly from the `snowpark` library.
Install
-
pip install snowpark-connect
Imports
- connect_with_spark_session_builder
from snowpark_connect.session import connect_with_spark_session_builder
- Session
from snowpark_connect.session import Session
from snowpark.snowpark_session import Session
- DataFrame
from snowpark_connect.dataframe import DataFrame
from snowpark.dataframe import DataFrame
Quickstart
import os
from snowpark_connect.session import connect_with_spark_session_builder
from snowpark.types import StructType, StructField, StringType, IntegerType
# Create a local Spark session that emulates Snowpark behavior
# Ensure these JARs are compatible with your Spark and Snowflake versions.
spark_session = connect_with_spark_session_builder(
app_name="SnowparkConnectLocalApp",
config={
"spark.jars.packages": "net.snowflake:snowflake-jdbc:3.13.29,net.snowflake:spark-snowflake_2.12:2.11.0-spark_3.4",
"spark.jars.repositories": "https://repo1.maven.org/maven2"
}
)
# Use the Spark session to create a Snowpark session
session = spark_session.getOrCreateSnowparkSession()
# Example: Create a Snowpark DataFrame and show its content
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())
])
data = [("Alice", 30), ("Bob", 25)]
df = session.create_dataframe(data, schema=schema)
df.show()
session.close()
spark_session.stop()