{"library":"spark-nlp","title":"Spark NLP","type":"library","description":"John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML, providing performant and accurate NLP annotations for machine learning pipelines that scale in a distributed environment. It is currently at version 6.4.0 and releases new versions frequently, often multiple times a month, with a strong focus on LLM integration, multimodal document processing, and pipeline robustness.","language":"python","status":"active","last_verified":"Fri May 22","install":{"commands":["pip install spark-nlp pyspark","pip install spark-nlp[tensorflow]","pip install spark-nlp[databricks]"],"cli":null},"imports":["import sparknlp","from sparknlp.base import DocumentAssembler","from sparknlp.annotator import Tokenizer","from sparknlp.annotator import WordEmbeddingsModel","from pyspark.ml import Pipeline","from pyspark.sql import SparkSession","from sparknlp.base import LightPipeline"],"auth":{"required":false,"env_vars":[]},"links":{"homepage":"https://nlp.johnsnowlabs.com","github":"https://github.com/JohnSnowLabs/spark-nlp","docs":null,"changelog":null,"pypi":"https://pypi.org/project/spark-nlp/","npm":null,"openapi_spec":null,"status_page":null,"smithery":null},"quickstart":{"code":"import sparknlp\nfrom sparknlp.base import DocumentAssembler, Pipeline, LightPipeline\nfrom sparknlp.annotator import Tokenizer, WordEmbeddingsModel\n\n# 1. Initialize SparkSession with Spark NLP\n# Automatically handles Spark NLP JAR dependencies and configures Spark\n# Adjust spark_version, spark_memory, and scala_version as needed for your environment\nspark = sparknlp.start(spark_version='3.4', spark_memory='16g', scala_version='2.12')\n\n# 2. Define a simple Spark NLP pipeline\ndocument_assembler = DocumentAssembler().setInputCol(\"text\").setOutputCol(\"document\")\ntokenizer = Tokenizer().setInputCols([\"document\"]).setOutputCol(\"token\")\n\n# Load a pre-trained Word Embeddings model\n# (glove_100d is a small, general-purpose model suitable for quickstarts)\nword_embeddings = WordEmbeddingsModel.pretrained(\"glove_100d\", \"en\")\\\n    .setInputCols([\"document\", \"token\"])\\\n    .setOutputCol(\"embeddings\")\n\nnlp_pipeline = Pipeline(stages=[\n    document_assembler,\n    tokenizer,\n    word_embeddings\n])\n\n# 3. Create a Spark DataFrame and process it\ndata = spark.createDataFrame([[\"Spark NLP is a powerful library for natural language processing on Apache Spark.\"]]).toDF(\"text\")\n\n# Fit the pipeline to the data (this often involves downloading models if not cached)\npipeline_model = nlp_pipeline.fit(data)\nresult = pipeline_model.transform(data)\n\n# 4. Show results\nprint(\"\\nPipeline Result:\")\nresult.select(\"token.result\", \"embeddings.result\").show(truncate=False)\n\n# Example of LightPipeline for single-record inference\nlight_pipeline = LightPipeline(pipeline_model)\nlight_result = light_pipeline.annotate(\"Spark NLP makes NLP scalable and easy.\")\nprint(\"\\nLightPipeline Result (tokens):\", light_result['token'])\n\n# Don't forget to stop the SparkSession when done\nspark.stop()","lang":"python","description":"This quickstart demonstrates how to initialize Spark NLP, define a basic NLP pipeline with document assembly, tokenization, and pre-trained word embeddings, and process a Spark DataFrame. It also shows how to use `LightPipeline` for faster inference on single inputs. The `sparknlp.start()` function simplifies Spark session setup, ensuring correct JAR and configuration loading.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-22","installed_version":"6.4.0","pypi_latest":"6.4.0","is_stale":false,"summary":{"python_range":"3.10–3.9","success_rate":100,"avg_install_s":11.6,"avg_import_s":null,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"511.5M"},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"24.2M"},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"24.2M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":31.6,"import_time_s":null,"mem_mb":null,"disk_size":"512M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":"25M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.9,"import_time_s":null,"mem_mb":null,"disk_size":"25M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"518.0M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"26.5M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"26.5M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":29.4,"import_time_s":null,"mem_mb":null,"disk_size":"519M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.9,"import_time_s":null,"mem_mb":null,"disk_size":"27M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":2,"import_time_s":null,"mem_mb":null,"disk_size":"27M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"506.9M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"18.3M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"18.3M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":31.9,"import_time_s":null,"mem_mb":null,"disk_size":"507M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":"19M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.9,"import_time_s":null,"mem_mb":null,"disk_size":"19M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"506.0M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"17.8M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"17.8M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":31,"import_time_s":null,"mem_mb":null,"disk_size":"506M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.7,"import_time_s":null,"mem_mb":null,"disk_size":"18M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":"18M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"489.9M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"23.7M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"23.7M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"spark-nlp","exit_code":0,"wheel_type":"sdist","failure_reason":null,"import_side_effects":"broken","install_time_s":31,"import_time_s":null,"mem_mb":null,"disk_size":"490M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"databricks","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":2,"import_time_s":null,"mem_mb":null,"disk_size":"24M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"tensorflow","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":2.3,"import_time_s":null,"mem_mb":null,"disk_size":"24M"}]}}