ceja

0.4.0 · maintenance · verified Thu Apr 16

ceja is a Python library that provides PySpark implementations of string and phonetic matching algorithms. It enables users to apply functions like NYSIIS, Metaphone, Jaro-Winkler similarity, and Damerau-Levenshtein distance directly within PySpark DataFrames, leveraging Spark's distributed processing capabilities for large datasets. The library is currently at version 0.4.0, with its last release in February 2023, indicating a slow release cadence.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize a SparkSession, create a DataFrame, and apply a ceja string matching function (damerau_levenshtein_distance) to columns within the DataFrame. Ensure findspark is installed and initialized if you're not running directly on a Spark cluster.

import findspark
findspark.init() # Initialize findspark if not running in a native Spark environment
from pyspark.sql import SparkSession
import pyspark.sql.functions as sf
import ceja

# Create a SparkSession
spark = SparkSession.builder.appName("CejaQuickstart").getOrCreate()

# Sample data
data = [ ("jellyfish", "smellyfish"), ("li", "lee"), ("luisa", "bruna"), (None, None) ]
df = spark.createDataFrame(data, ["word1", "word2"])

# Apply a ceja function (e.g., damerau_levenshtein_distance)
result_df = df.withColumn(
    "damerau_levenshtein_distance",
    ceja.damerau_levenshtein_distance(sf.col("word1"), sf.col("word2"))
)

result_df.show()

# Stop SparkSession
spark.stop()

view raw JSON →