GraphFrames: DataFrame-based Graphs
GraphFrames is a powerful library for graph processing built on Apache Spark DataFrames. It allows users to perform graph analytics, queries, and algorithms like PageRank, connected components, and shortest paths directly using Spark's DataFrame API. The project is actively maintained, with frequent releases. The official Python package on PyPI is `graphframes-py`, and its latest version is 0.11.0.
Warnings
- breaking The official PyPI package name for GraphFrames changed from `graphframes` to `graphframes-py` starting with v0.9.0. The old `graphframes` package on PyPI is severely outdated (v0.6) and should not be used.
- breaking The Maven groupId for the GraphFrames JAR changed from `graphframes` to `io.graphframes` in v0.9.0. This affects how you specify the package with `--packages` in `spark-submit` or in `SparkSession` configurations.
- gotcha GraphFrames is a Spark library and requires a running Spark cluster. Installing the Python package (`graphframes-py`) is not enough; you must also provide the GraphFrames JAR to your Spark runtime. This is typically done via `spark-submit --packages` or `SparkSession.builder.config("spark.jars.packages", ...)`.
- gotcha GraphFrames has strict compatibility requirements with specific versions of Spark and Scala. Using an incompatible GraphFrames JAR version with your Spark distribution will lead to runtime errors.
- breaking Significant API updates occurred in v0.9.0, including changes to the Pregel API and internal implementations of algorithms like Connected Components (CC), Community Detection using Label Propagation (CDLP), and Shortest Paths (SP). Some GraphX-free implementations were introduced.
Install
-
pip install graphframes-py -
spark-submit --packages org.graphframes:graphframes:0.11.0-spark3.5-s_2.12 your_app.py
Imports
- GraphFrame
from graphframes import GraphFrame
Quickstart
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from graphframes import GraphFrame
# Create a SparkSession with GraphFrames package
# IMPORTANT: Replace '0.11.0-spark3.5-s_2.12' with the version compatible
# with your Spark and Scala installation. Check GraphFrames docs for details.
spark = SparkSession.builder \
.appName("GraphFrames Quickstart") \
.config("spark.jars.packages", "org.graphframes:graphframes:0.11.0-spark3.5-s_2.12") \
.getOrCreate()
# Create a Vertex DataFrame
v = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36)
], ["id", "name", "age"])
# Create an Edge DataFrame
e = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend")
], ["src", "dst", "relationship"])
# Create a GraphFrame
g = GraphFrame(v, e)
# Run PageRank algorithm
results = g.pagerank(resetProbability=0.15, maxIter=5)
results.vertices.select("id", "pagerank").show()
spark.stop()