ceja
ceja is a Python library that provides PySpark implementations of string and phonetic matching algorithms. It enables users to apply functions like NYSIIS, Metaphone, Jaro-Winkler similarity, and Damerau-Levenshtein distance directly within PySpark DataFrames, leveraging Spark's distributed processing capabilities for large datasets. The library is currently at version 0.4.0, with its last release in February 2023, indicating a slow release cadence.
Common errors
-
ModuleNotFoundError: No module named 'pyspark'
cause The PySpark library is not installed or not correctly configured in your Python environment. ceja relies entirely on PySpark for its functionality.fixInstall PySpark using `pip install pyspark` and ensure your environment variables (like `SPARK_HOME`) are correctly set if running locally. -
AttributeError: 'str' object has no attribute '_jc_op_eq'
cause You are attempting to pass a literal Python string to a ceja function which expects a Spark Column object.fixWrap your string literals with `sf.lit()` or ensure you are passing `sf.col("your_column")` to ceja functions. -
TypeError: Column is not iterable
cause A ceja function received an incorrect type, likely a Spark Column object where a Python iterable or a different Column operation was expected, or vice-versa.fixVerify the expected input types for the specific ceja function you are using. Ensure you are applying the function to the correct Spark DataFrame columns.
Warnings
- gotcha ceja functions are designed for PySpark DataFrames. Attempting to use them directly on native Python strings or non-Spark data structures will result in runtime errors like 'AttributeError' or 'TypeError'.
- gotcha The library's last update was in February 2023. While generally stable, this slow release cadence might lead to compatibility issues with very recent versions of Python or PySpark, or a slower response to new feature requests/bug fixes.
- gotcha The project lacks extensive official documentation beyond the GitHub README and has an empty project description on PyPI, which can make advanced usage or troubleshooting more challenging.
Install
-
pip install ceja
Imports
- ceja
import ceja
Quickstart
import findspark
findspark.init() # Initialize findspark if not running in a native Spark environment
from pyspark.sql import SparkSession
import pyspark.sql.functions as sf
import ceja
# Create a SparkSession
spark = SparkSession.builder.appName("CejaQuickstart").getOrCreate()
# Sample data
data = [ ("jellyfish", "smellyfish"), ("li", "lee"), ("luisa", "bruna"), (None, None) ]
df = spark.createDataFrame(data, ["word1", "word2"])
# Apply a ceja function (e.g., damerau_levenshtein_distance)
result_df = df.withColumn(
"damerau_levenshtein_distance",
ceja.damerau_levenshtein_distance(sf.col("word1"), sf.col("word2"))
)
result_df.show()
# Stop SparkSession
spark.stop()