{"id":4393,"library":"spark-nlp","title":"Spark NLP","description":"John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML, providing performant and accurate NLP annotations for machine learning pipelines that scale in a distributed environment. It is currently at version 6.4.0 and releases new versions frequently, often multiple times a month, with a strong focus on LLM integration, multimodal document processing, and pipeline robustness.","status":"active","version":"6.4.0","language":"en","source_language":"en","source_url":"https://github.com/JohnSnowLabs/spark-nlp","tags":["nlp","apache spark","machine learning","distributed computing","llm","text processing","ai"],"install":[{"cmd":"pip install spark-nlp pyspark","lang":"bash","label":"Install core Spark NLP and PySpark"},{"cmd":"pip install spark-nlp[tensorflow]","lang":"bash","label":"Install with TensorFlow dependencies (e.g., for some models)"},{"cmd":"pip install spark-nlp[databricks]","lang":"bash","label":"Install for Databricks environments"}],"dependencies":[{"reason":"Spark NLP is built on Apache Spark and requires PySpark to interact with Spark clusters.","package":"pyspark","optional":false},{"reason":"Apache Spark runs on the JVM; a compatible JDK (typically JDK 8 or 11) is required for Spark NLP to function.","package":"Java Development Kit (JDK)","optional":false},{"reason":"Spark NLP's underlying JARs are built for specific Scala versions, which must match the Scala version of your Apache Spark installation.","package":"Scala","optional":false}],"imports":[{"symbol":"sparknlp","correct":"import sparknlp"},{"symbol":"DocumentAssembler","correct":"from sparknlp.base import DocumentAssembler"},{"symbol":"Tokenizer","correct":"from sparknlp.annotator import Tokenizer"},{"symbol":"WordEmbeddingsModel","correct":"from sparknlp.annotator import WordEmbeddingsModel"},{"symbol":"Pipeline","correct":"from pyspark.ml import Pipeline"},{"note":"While SparkSession is often instantiated using sparknlp.start(), the class itself is from pyspark.sql. When configuring manually, import from pyspark.sql.","wrong":"from sparknlp.base import SparkSession","symbol":"SparkSession","correct":"from pyspark.sql import SparkSession"},{"symbol":"LightPipeline","correct":"from sparknlp.base import LightPipeline"}],"quickstart":{"code":"import sparknlp\nfrom sparknlp.base import DocumentAssembler, Pipeline, LightPipeline\nfrom sparknlp.annotator import Tokenizer, WordEmbeddingsModel\n\n# 1. Initialize SparkSession with Spark NLP\n# Automatically handles Spark NLP JAR dependencies and configures Spark\n# Adjust spark_version, spark_memory, and scala_version as needed for your environment\nspark = sparknlp.start(spark_version='3.4', spark_memory='16g', scala_version='2.12')\n\n# 2. Define a simple Spark NLP pipeline\ndocument_assembler = DocumentAssembler().setInputCol(\"text\").setOutputCol(\"document\")\ntokenizer = Tokenizer().setInputCols([\"document\"]).setOutputCol(\"token\")\n\n# Load a pre-trained Word Embeddings model\n# (glove_100d is a small, general-purpose model suitable for quickstarts)\nword_embeddings = WordEmbeddingsModel.pretrained(\"glove_100d\", \"en\")\\\n    .setInputCols([\"document\", \"token\"])\\\n    .setOutputCol(\"embeddings\")\n\nnlp_pipeline = Pipeline(stages=[\n    document_assembler,\n    tokenizer,\n    word_embeddings\n])\n\n# 3. Create a Spark DataFrame and process it\ndata = spark.createDataFrame([[\"Spark NLP is a powerful library for natural language processing on Apache Spark.\"]]).toDF(\"text\")\n\n# Fit the pipeline to the data (this often involves downloading models if not cached)\npipeline_model = nlp_pipeline.fit(data)\nresult = pipeline_model.transform(data)\n\n# 4. Show results\nprint(\"\\nPipeline Result:\")\nresult.select(\"token.result\", \"embeddings.result\").show(truncate=False)\n\n# Example of LightPipeline for single-record inference\nlight_pipeline = LightPipeline(pipeline_model)\nlight_result = light_pipeline.annotate(\"Spark NLP makes NLP scalable and easy.\")\nprint(\"\\nLightPipeline Result (tokens):\", light_result['token'])\n\n# Don't forget to stop the SparkSession when done\nspark.stop()","lang":"python","description":"This quickstart demonstrates how to initialize Spark NLP, define a basic NLP pipeline with document assembly, tokenization, and pre-trained word embeddings, and process a Spark DataFrame. It also shows how to use `LightPipeline` for faster inference on single inputs. The `sparknlp.start()` function simplifies Spark session setup, ensuring correct JAR and configuration loading."},"warnings":[{"fix":"Ensure your `pyspark` and underlying Spark cluster's Scala version (e.g., 2.12 or 2.13) are compatible with the Spark NLP version. Use `sparknlp.start(spark_version='X.Y', scala_version='Z.W')` or consult the official Spark NLP compatibility matrix (https://nlp.johnsnowlabs.com/docs/en/install#compatibility-matrix) for precise version pairings. For example, Spark NLP 6.x generally works with Spark 3.x-4.x.","message":"Spark NLP has strict compatibility requirements with Apache Spark and Scala versions. Mismatches can lead to `ClassNotFoundException`, `NoSuchMethodError`, or other runtime errors.","severity":"breaking","affected_versions":"All versions, especially when upgrading Spark NLP, PySpark, or Spark clusters."},{"fix":"Increase Spark driver and executor memory. You can set this when starting the session: `spark = sparknlp.start(spark_memory='16g', spark_driver_maxResultSize='8g')` or configure `spark.driver.memory`, `spark.executor.memory`, and `spark.driver.maxResultSize` directly in your Spark configuration.","message":"Spark NLP operations, particularly model training or processing large documents/datasets, can be memory-intensive due to its reliance on the JVM. Default Spark/JVM memory settings may be insufficient, leading to `OutOfMemoryError`.","severity":"gotcha","affected_versions":"All versions."},{"fix":"Use `LightPipeline` for individual text strings or small lists (e.g., API endpoints, quick demos). Use the full `Pipeline` (fitting and transforming a Spark DataFrame) for production-scale batch processing, leveraging Spark's distribution capabilities.","message":"Choosing between `LightPipeline` and the full `Pipeline` for inference. `LightPipeline` is optimized for fast, single-record processing (Python-based), while `Pipeline` is for large-scale, distributed batch processing within Spark. Misusing them can lead to performance bottlenecks.","severity":"gotcha","affected_versions":"All versions."},{"fix":"Prefer `sparknlp.start()` for initializing your SparkSession, as it simplifies dependency management and configuration. If manual configuration is necessary (e.g., specific cluster setups), ensure `spark.jars.packages` correctly specifies the Spark NLP version matching your environment.","message":"Manually configuring SparkSession to include Spark NLP JARs using `spark.jars.packages` is largely superseded by `sparknlp.start()`, which automates this process and handles versioning and compatibility.","severity":"deprecated","affected_versions":"Older versions (pre-4.x/5.x) and manual configurations. While it still works, `sparknlp.start()` is the recommended approach for most users."}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}