{"id":14485,"library":"ceja","title":"ceja","description":"ceja is a Python library that provides PySpark implementations of string and phonetic matching algorithms. It enables users to apply functions like NYSIIS, Metaphone, Jaro-Winkler similarity, and Damerau-Levenshtein distance directly within PySpark DataFrames, leveraging Spark's distributed processing capabilities for large datasets. The library is currently at version 0.4.0, with its last release in February 2023, indicating a slow release cadence.","status":"maintenance","version":"0.4.0","language":"en","source_language":"en","source_url":"https://github.com/mrpowers/ceja","tags":["pyspark","string matching","phonetic algorithms","stemming","data processing","fuzzy matching"],"install":[{"cmd":"pip install ceja","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"ceja functions are designed to operate on PySpark DataFrames and require a running Spark environment.","package":"pyspark","optional":false}],"imports":[{"note":"All functions are exposed directly under the 'ceja' module namespace after a simple import.","symbol":"ceja","correct":"import ceja"}],"quickstart":{"code":"import findspark\nfindspark.init() # Initialize findspark if not running in a native Spark environment\nfrom pyspark.sql import SparkSession\nimport pyspark.sql.functions as sf\nimport ceja\n\n# Create a SparkSession\nspark = SparkSession.builder.appName(\"CejaQuickstart\").getOrCreate()\n\n# Sample data\ndata = [ (\"jellyfish\", \"smellyfish\"), (\"li\", \"lee\"), (\"luisa\", \"bruna\"), (None, None) ]\ndf = spark.createDataFrame(data, [\"word1\", \"word2\"])\n\n# Apply a ceja function (e.g., damerau_levenshtein_distance)\nresult_df = df.withColumn(\n    \"damerau_levenshtein_distance\",\n    ceja.damerau_levenshtein_distance(sf.col(\"word1\"), sf.col(\"word2\"))\n)\n\nresult_df.show()\n\n# Stop SparkSession\nspark.stop()\n","lang":"python","description":"This quickstart demonstrates how to initialize a SparkSession, create a DataFrame, and apply a ceja string matching function (damerau_levenshtein_distance) to columns within the DataFrame. Ensure findspark is installed and initialized if you're not running directly on a Spark cluster."},"warnings":[{"fix":"Always pass Spark DataFrame columns (e.g., `sf.col(\"column_name\")`) to ceja functions after importing `pyspark.sql.functions as sf`.","message":"ceja functions are designed for PySpark DataFrames. Attempting to use them directly on native Python strings or non-Spark data structures will result in runtime errors like 'AttributeError' or 'TypeError'.","severity":"gotcha","affected_versions":"All"},{"fix":"Test thoroughly with your specific PySpark and Python environment. Consider pinning versions if stability is critical.","message":"The library's last update was in February 2023. While generally stable, this slow release cadence might lead to compatibility issues with very recent versions of Python or PySpark, or a slower response to new feature requests/bug fixes.","severity":"gotcha","affected_versions":"All current versions (0.4.0)"},{"fix":"Refer to the GitHub repository's README for available functions and basic usage patterns. Inspect the source code if deeper understanding is required.","message":"The project lacks extensive official documentation beyond the GitHub README and has an empty project description on PyPI, which can make advanced usage or troubleshooting more challenging.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Install PySpark using `pip install pyspark` and ensure your environment variables (like `SPARK_HOME`) are correctly set if running locally.","cause":"The PySpark library is not installed or not correctly configured in your Python environment. ceja relies entirely on PySpark for its functionality.","error":"ModuleNotFoundError: No module named 'pyspark'"},{"fix":"Wrap your string literals with `sf.lit()` or ensure you are passing `sf.col(\"your_column\")` to ceja functions.","cause":"You are attempting to pass a literal Python string to a ceja function which expects a Spark Column object.","error":"AttributeError: 'str' object has no attribute '_jc_op_eq'"},{"fix":"Verify the expected input types for the specific ceja function you are using. Ensure you are applying the function to the correct Spark DataFrame columns.","cause":"A ceja function received an incorrect type, likely a Spark Column object where a Python iterable or a different Column operation was expected, or vice-versa.","error":"TypeError: Column is not iterable"}],"ecosystem":"pypi"}