Snowpark Connect Dependencies Part 2
The `snowpark-connect-deps-2` package provides supporting JAR dependencies essential for Snowflake's Snowpark Connect for Spark. Snowpark Connect enables developers to execute Apache Spark workloads directly on Snowflake's high-performance compute engine, leveraging familiar Spark DataFrame APIs without the overhead of managing a dedicated Spark cluster. This package, alongside `snowpark-connect-deps-1`, underpins the functionality of the user-facing `snowpark-connect` library, which is part of the broader Snowpark for Python ecosystem. It is currently at version 3.56.4 and follows a rapid release cadence in conjunction with the `snowpark-connect` library.
Common errors
-
AttributeError: module 'snowflake.snowpark' has no attribute '_internal'
cause This error or similar `DeprecationWarning: pkg_resources is deprecated` often indicates an issue with package resolution, potentially due to outdated `setuptools` or an environment conflict.fixUpdate `setuptools` (`pip install --upgrade setuptools`). If the issue persists, try recreating your virtual environment and reinstalling `snowflake-snowpark-python` and `snowpark-connect`. -
java.lang.RuntimeException: [FATAL] No JVM found.
cause Snowpark Connect for Spark relies on Java, and this error indicates that a Java Virtual Machine (JVM) could not be located or properly initialized in the environment.fixInstall a supported JDK (e.g., OpenJDK 11 or 17) and ensure the `JAVA_HOME` environment variable is correctly set to the root directory of your JDK installation. For example, `export JAVA_HOME=/path/to/jdk-17`. -
org.apache.spark.sql.AnalysisException: Cannot resolve '`your_column`' given input columns
cause This typically occurs when a column name used in a Spark DataFrame operation does not exist in the DataFrame's schema, often due to case sensitivity differences between Spark and Snowflake, or incorrect transformations.fixVerify column names and their casing against the actual schema. Snowflake typically stores identifiers in uppercase by default, so ensure consistency or use proper quoting mechanisms if mixed-case identifiers are used. Review the Spark Connect compatibility guide for semantic differences.
Warnings
- gotcha The `snowpark-connect-deps-2` package is a low-level dependency for `snowpark-connect`. Direct interaction or imports are not expected; issues often stem from environmental setup or `snowpark-connect` itself.
- gotcha Snowpark Connect for Spark often requires a correctly configured Java Development Kit (JDK) in your environment, typically Java 11 or 17. Without it, `snowpark-connect` might fail to initialize or run properly.
- breaking Snowpark Connect for Spark implicitly converts certain Spark integral data types (`ByteType`, `ShortType`, `IntegerType`) to `LongType` when operating on data. This can lead to unexpected type changes.
- gotcha Snowpark Connect for Spark has limitations regarding User-Defined Functions (UDFs) within lambda expressions. UDFs are generally not supported inside lambdas, including some built-in functions implemented as Snowflake UDFs.
Install
-
pip install snowpark-connect-deps-2
Quickstart
import os
from snowflake import snowpark_connect
from pyspark.sql import SparkSession
# Set environment variable to enable Spark Connect mode
os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1"
# Configure connection parameters (replace with your Snowflake details)
# It's recommended to use environment variables or a configuration file
# for sensitive information like passwords.
os.environ["SNOWFLAKE_ACCOUNT"] = os.environ.get("SNOWFLAKE_ACCOUNT", "your_account_identifier")
os.environ["SNOWFLAKE_USER"] = os.environ.get("SNOWFLAKE_USER", "your_username")
os.environ["SNOWFLAKE_PASSWORD"] = os.environ.get("SNOWFLAKE_PASSWORD", "your_password")
os.environ["SNOWFLAKE_ROLE"] = os.environ.get("SNOWFLAKE_ROLE", "your_role")
os.environ["SNOWFLAKE_WAREHOUSE"] = os.environ.get("SNOWFLAKE_WAREHOUSE", "your_warehouse")
os.environ["SNOWFLAKE_DATABASE"] = os.environ.get("SNOWFLAKE_DATABASE", "your_database")
os.environ["SNOWFLAKE_SCHEMA"] = os.environ.get("SNOWFLAKE_SCHEMA", "your_schema")
# Start the Spark Connect session
snowpark_connect.start_session()
spark = snowpark_connect.get_session()
# Example: Create a DataFrame and show data
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
df = spark.createDataFrame(data, ["Name", "ID"])
df.show()
# Stop the Spark session when done
spark.stop()