Amazon SageMaker Feature Store PySpark Bindings

raw JSON →
1.2.0 verified Mon Apr 27 auth: no python

PySpark bindings for Amazon SageMaker Feature Store, enabling large-scale feature engineering and serving with Spark DataFrames. Current version 1.2.0, released monthly.

pip install sagemaker-feature-store-pyspark
error ImportError: No module named 'sagemaker_feature_store_pyspark'
cause The package name differs from import path; the correct import uses submodules under 'sagemaker'.
fix
Use 'from sagemaker.feature_store.feature_store import FeatureStoreManager' instead.
error Py4JJavaError: An error occurred while calling o123.ingest. : org.apache.spark.SparkException: Job aborted due to stage failure
cause Missing Hadoop AWS JARs in Spark classpath.
fix
Add hadoop-aws JAR via spark.jars.packages or --jars.
gotcha PySpark session must be configured with the correct Hadoop AWS JARs for S3 access; missing JARs cause silent failures on ingest.
fix Use SparkSession.builder.config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4').getOrCreate() or provide JARs via --jars.
gotcha Ingesting DataFrames with columns containing null values in the record identifier column will fail with a Spark exception.
fix Ensure the record identifier column has no nulls; use df.na.drop(subset=['id']) before ingest.
breaking In version 1.0.0, the module was restructured: `from sagemaker_feature_store_pyspark import FeatureStore` changed to `from sagemaker.feature_store.feature_store import FeatureStoreManager`.
fix Update imports to use the new path. Old code will raise ImportError.

Creates a Spark DataFrame and ingests it into a SageMaker Feature Group using PySpark bindings.

from pyspark.sql import SparkSession
from sagemaker.feature_store.feature_store import FeatureStoreManager

spark = SparkSession.builder.getOrCreate()
fs = FeatureStoreManager()

df = spark.createDataFrame([(1, 'a'), (2, 'b')], ['id', 'value'])
record_id = 'id'
feature_group_name = 'my-feature-group'
fs.ingest(df, feature_group_name, record_identifier_name=record_id)