AWS Glue Development Library (awsglue-dev)

2021.12.30 · active · verified Wed Apr 15

awsglue-dev provides Python interfaces to the AWS Glue ETL library, primarily for local development, IDE auto-completion, and local script validation. It extends Apache Spark with additional data types and operations for ETL workflows. The package version `2021.12.30` is part of an ecosystem that facilitates authoring scripts for AWS Glue, a fully managed, serverless ETL service.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the foundational boilerplate for an AWS Glue ETL script, initializing the SparkContext, GlueContext, and Job objects. While `awsglue-dev` provides the interfaces locally, actual ETL execution often requires a Glue environment (e.g., Docker container or AWS Glue service) to run successfully with real data sources. The `getResolvedOptions` function is used to handle job parameters, which are central to Glue job execution.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

# These parameters are typically passed by AWS Glue service
# For local development, you might set dummy values or omit if not testing getResolvedOptions
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
gLueContext = GlueContext(sc)
spark = gLueContext.spark_session
job = Job(gLueContext)
job.init(args['JOB_NAME'], args)

# Your Glue ETL script logic would go here
# For example, to create a DynamicFrame:
# from awsglue.dynamicframe import DynamicFrame
# dynamic_frame = gLueContext.create_dynamic_frame.from_options(
#     connection_type='s3', 
#     connection_options={'paths': ['s3://your-bucket/your-data/'], 'recurse': True}, 
#     format='json'
# )

print(f"Initialized GlueContext and SparkSession for job: {args['JOB_NAME']}")
# Don't forget job.commit() in a real Glue job
# job.commit()

view raw JSON →