AWS Glue Local Development
The `awsglue3-local` package is a Python utility for facilitating local development of AWS Glue 3.0 jobs. It aims to simplify the setup of a local PySpark environment that mimics the Glue 3.0 runtime, allowing developers to test Glue scripts outside of the AWS cloud. As of its latest release, it's at version 1.0.0. The release cadence is irregular, typically tied to the need for Glue version compatibility.
Common errors
-
ModuleNotFoundError: No module named 'awsglue.context'
cause The Python environment does not have the `awsglue` module accessible on its `sys.path` or `awsglue3-local` failed to properly configure the environment.fixEnsure `awsglue3-local` is correctly installed. If running with PySpark, confirm that the Glue libraries (often `aws-glue-libs.jar`) are included in your Spark classpath. For `awsglue3-local`, this should ideally be handled, but manual intervention might be needed for complex setups. -
java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem
cause This typically indicates that the necessary Hadoop AWS S3 connector JARs are missing from your Spark classpath, which are required for interacting with S3 buckets from Spark/Glue.fixEnsure your local Spark environment (or the environment configured by `awsglue3-local`) includes the correct Hadoop-AWS JARs. For `pyspark` directly, this often involves `spark-submit --packages org.apache.hadoop:hadoop-aws:x.y.z ...`. -
Py4JJavaError: An error occurred while calling o72.getDynamicFrame.fromDF.
cause This error often occurs when `DynamicFrame` operations are attempted without a fully initialized Glue context or when there are underlying Spark/JVM issues with the Glue extensions. It can also happen if the `awsglue` libraries are not properly linked.fixVerify that `GlueContext` is correctly initialized (`GlueContext(sc)`), and `job.init()` is called. Ensure your local environment is robust enough to handle the Glue-specific Spark extensions. Sometimes, restarting the Spark session helps.
Warnings
- gotcha The `awsglue` module is typically part of the AWS Glue runtime and is not fully pip-installable as a complete, standalone library providing all native Glue functionalities. `awsglue3-local` aims to provide the necessary environment and stub modules to allow standard Glue job scripts to run locally, but some features (e.g., direct S3/JDBC connectors without specific Hadoop/Spark configurations) might still require additional setup or behave differently.
- breaking Local Glue development environments, including those set up with `awsglue3-local`, can exhibit behavioral differences compared to the actual AWS Glue cloud environment. These discrepancies can stem from differences in Spark configuration, underlying libraries, resource management, or specific Glue service integrations not fully replicated locally.
- gotcha When using `getResolvedOptions`, if job arguments are not provided (e.g., when running a script directly without emulating `spark-submit --conf 'spark.driver.args="--JOB_NAME myjob"'`), it will raise an error indicating required arguments are missing. This is a common pitfall in local development.
Install
-
pip install awsglue3-local
Imports
- GlueContext
from glue.context import GlueContext
from awsglue.context import GlueContext
- SparkSession
from pyspark.sql import SparkSession
- getResolvedOptions
from glue.utils import getResolvedOptions
from awsglue.utils import getResolvedOptions
- DynamicFrame
from awsglue.dynamicframe import DynamicFrame
Quickstart
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# This part mimics how Glue passes arguments
# In local development, you might set these via command line or hardcode them
# For quickstart, we use an empty dict if not provided.
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glucueContext = GlueContext(sc)
spark = glucueContext.spark_session
job = Job(glucueContext)
job.init(args['JOB_NAME'], args)
# Example: Create a simple Spark DataFrame
data = [("Alice", 1), ("Bob", 2)]
df = spark.createDataFrame(data, ["Name", "Id"])
df.show()
# Example: Use Glue DynamicFrame (requires more setup for actual data sources)
# try:
# from awsglue.dynamicframe import DynamicFrame
# # This part would typically involve reading from S3, JDBC, etc.
# # For a truly local test, you might convert a Spark DataFrame to DynamicFrame
# dynamic_frame = DynamicFrame.fromDF(df, glucueContext, "example_df")
# dynamic_frame.printSchema()
# except ImportError:
# print("awsglue.dynamicframe not fully functional in this minimal local setup without full Glue libs.")
print("Glue job finished locally.")
job.commit()