AWS Glue Development Library (awsglue-dev)
awsglue-dev provides Python interfaces to the AWS Glue ETL library, primarily for local development, IDE auto-completion, and local script validation. It extends Apache Spark with additional data types and operations for ETL workflows. The package version `2021.12.30` is part of an ecosystem that facilitates authoring scripts for AWS Glue, a fully managed, serverless ETL service.
Warnings
- breaking The `awsglue-dev` package primarily offers Python interfaces for local development (e.g., IDE auto-completion, static analysis). Actual AWS Glue ETL scripts built with these interfaces *must be executed within the AWS Glue service* or a compatible local Docker environment that includes the Glue Spark runtime JARs.
- gotcha For local development with `awsglue-dev`, the `pyspark` library is a mandatory peer dependency and must be installed separately. Without it, core components like `SparkContext` and `GlueContext` will not function.
- gotcha AWS Glue job scripts, by default, may not automatically partition output data when writing to target data sources. This can lead to poor performance on large datasets.
- gotcha AWS Glue environments (which `awsglue-dev` mirrors) often ship with a set of pre-installed Python packages, some of which may contain known vulnerabilities or be outdated. Relying solely on these default versions can pose security risks.
- breaking Migrating AWS Glue jobs between major Glue versions (e.g., from Glue 2.0/3.0 to 4.0/5.0) can introduce breaking changes due to underlying Spark version upgrades, changes in supported Python versions, or deprecation of certain libraries/APIs. Scripts developed with `awsglue-dev` might need adjustments.
Install
-
pip install awsglue-dev -
pip install pyspark
Imports
- GlueContext
from awsglue.context import GlueContext
- SparkContext
from pyspark.context import SparkContext
- Job
from awsglue.job import Job
- getResolvedOptions
from awsglue.utils import getResolvedOptions
- DynamicFrame
from awsglue.dynamicframe import DynamicFrame
- * (transforms)
from awsglue.transforms import *
Quickstart
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
# These parameters are typically passed by AWS Glue service
# For local development, you might set dummy values or omit if not testing getResolvedOptions
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
gLueContext = GlueContext(sc)
spark = gLueContext.spark_session
job = Job(gLueContext)
job.init(args['JOB_NAME'], args)
# Your Glue ETL script logic would go here
# For example, to create a DynamicFrame:
# from awsglue.dynamicframe import DynamicFrame
# dynamic_frame = gLueContext.create_dynamic_frame.from_options(
# connection_type='s3',
# connection_options={'paths': ['s3://your-bucket/your-data/'], 'recurse': True},
# format='json'
# )
print(f"Initialized GlueContext and SparkSession for job: {args['JOB_NAME']}")
# Don't forget job.commit() in a real Glue job
# job.commit()