GraphFrames: DataFrame-based Graphs

0.11.0 · active · verified Fri Apr 10

GraphFrames is a powerful library for graph processing built on Apache Spark DataFrames. It allows users to perform graph analytics, queries, and algorithms like PageRank, connected components, and shortest paths directly using Spark's DataFrame API. The project is actively maintained, with frequent releases. The official Python package on PyPI is `graphframes-py`, and its latest version is 0.11.0.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize a SparkSession with the GraphFrames package, create vertex and edge DataFrames, construct a GraphFrame, and run a PageRank algorithm. Ensure the `--packages` option uses the correct GraphFrames version for your Spark and Scala environment.

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from graphframes import GraphFrame

# Create a SparkSession with GraphFrames package
# IMPORTANT: Replace '0.11.0-spark3.5-s_2.12' with the version compatible
# with your Spark and Scala installation. Check GraphFrames docs for details.
spark = SparkSession.builder \
    .appName("GraphFrames Quickstart") \
    .config("spark.jars.packages", "org.graphframes:graphframes:0.11.0-spark3.5-s_2.12") \
    .getOrCreate()

# Create a Vertex DataFrame
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36)
], ["id", "name", "age"])

# Create an Edge DataFrame
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend")
], ["src", "dst", "relationship"])

# Create a GraphFrame
g = GraphFrame(v, e)

# Run PageRank algorithm
results = g.pagerank(resetProbability=0.15, maxIter=5)
results.vertices.select("id", "pagerank").show()

spark.stop()

view raw JSON →