Python Client for Apache Livy
pylivy is a Python client for Apache Livy, an open-source REST interface for interacting with Spark. It enables easy remote code execution on a Spark cluster, supporting interactive and batch sessions. The current version is 0.8.0, released in January 2021, and its development cadence appears to be as-needed.
Warnings
- gotcha The Python package name on PyPI is `livy`, but the GitHub repository and project are often referred to as `pylivy`. Ensure you use `pip install livy` for installation and `from livy import ...` for imports.
- gotcha The `LivySession.create()` method is the recommended way to initialize a session, rather than directly instantiating `LivySession()`. While older documentation or examples might show direct instantiation, `create()` handles session setup and waiting for readiness more robustly.
- gotcha When using `session.download()` to retrieve DataFrames, be aware that the entire DataFrame is collected and transferred to the client. This can lead to out-of-memory issues or slow performance for very large datasets. Consider processing large datasets on Spark and writing results to a shared storage (e.g., S3, HDFS) for efficient access.
- gotcha Python 3.6 or later is required. Earlier Python versions are not supported.
- gotcha For production environments, always secure your Apache Livy server with HTTPS and configure proper authentication. The `pylivy` client supports passing `requests` compatible Auth objects (e.g., `HTTPBasicAuth`) or custom `requests.Session` objects for secure communication.
Install
-
pip install livy
Imports
- LivySession
from livy import LivySession
- LivyBatch
from livy import LivyBatch
Quickstart
import os
from livy import LivySession
from requests.auth import HTTPBasicAuth
# Configure Livy server URL and optional authentication
LIVY_URL = os.environ.get('LIVY_SERVER_URL', 'http://localhost:8998')
LIVY_USERNAME = os.environ.get('LIVY_USERNAME', 'livy_user')
LIVY_PASSWORD = os.environ.get('LIVY_PASSWORD', 'livy_password')
auth = HTTPBasicAuth(LIVY_USERNAME, LIVY_PASSWORD) if LIVY_USERNAME else None
try:
with LivySession.create(LIVY_URL, auth=auth) as session:
print(f"Livy session {session.id} created successfully.")
# Run some Spark code on the remote cluster
session.run("df = spark.createDataFrame([(1, 'Alice'), (2, 'Bob')], ['id', 'name'])")
session.run("filtered_df = df.filter(df.name == 'Bob')")
# Retrieve the result (e.g., as a pandas DataFrame)
local_df = session.download("filtered_df")
print("Downloaded DataFrame:")
print(local_df)
except Exception as e:
print(f"An error occurred: {e}")
print("Ensure a Livy server is running and accessible at the specified URL.")