Spark NLP
John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML, providing performant and accurate NLP annotations for machine learning pipelines that scale in a distributed environment. It is currently at version 6.4.0 and releases new versions frequently, often multiple times a month, with a strong focus on LLM integration, multimodal document processing, and pipeline robustness.
Warnings
- breaking Spark NLP has strict compatibility requirements with Apache Spark and Scala versions. Mismatches can lead to `ClassNotFoundException`, `NoSuchMethodError`, or other runtime errors.
- gotcha Spark NLP operations, particularly model training or processing large documents/datasets, can be memory-intensive due to its reliance on the JVM. Default Spark/JVM memory settings may be insufficient, leading to `OutOfMemoryError`.
- gotcha Choosing between `LightPipeline` and the full `Pipeline` for inference. `LightPipeline` is optimized for fast, single-record processing (Python-based), while `Pipeline` is for large-scale, distributed batch processing within Spark. Misusing them can lead to performance bottlenecks.
- deprecated Manually configuring SparkSession to include Spark NLP JARs using `spark.jars.packages` is largely superseded by `sparknlp.start()`, which automates this process and handles versioning and compatibility.
Install
-
pip install spark-nlp pyspark -
pip install spark-nlp[tensorflow] -
pip install spark-nlp[databricks]
Imports
- sparknlp
import sparknlp
- DocumentAssembler
from sparknlp.base import DocumentAssembler
- Tokenizer
from sparknlp.annotator import Tokenizer
- WordEmbeddingsModel
from sparknlp.annotator import WordEmbeddingsModel
- Pipeline
from pyspark.ml import Pipeline
- SparkSession
from pyspark.sql import SparkSession
- LightPipeline
from sparknlp.base import LightPipeline
Quickstart
import sparknlp
from sparknlp.base import DocumentAssembler, Pipeline, LightPipeline
from sparknlp.annotator import Tokenizer, WordEmbeddingsModel
# 1. Initialize SparkSession with Spark NLP
# Automatically handles Spark NLP JAR dependencies and configures Spark
# Adjust spark_version, spark_memory, and scala_version as needed for your environment
spark = sparknlp.start(spark_version='3.4', spark_memory='16g', scala_version='2.12')
# 2. Define a simple Spark NLP pipeline
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
# Load a pre-trained Word Embeddings model
# (glove_100d is a small, general-purpose model suitable for quickstarts)
word_embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
word_embeddings
])
# 3. Create a Spark DataFrame and process it
data = spark.createDataFrame([["Spark NLP is a powerful library for natural language processing on Apache Spark."]]).toDF("text")
# Fit the pipeline to the data (this often involves downloading models if not cached)
pipeline_model = nlp_pipeline.fit(data)
result = pipeline_model.transform(data)
# 4. Show results
print("\nPipeline Result:")
result.select("token.result", "embeddings.result").show(truncate=False)
# Example of LightPipeline for single-record inference
light_pipeline = LightPipeline(pipeline_model)
light_result = light_pipeline.annotate("Spark NLP makes NLP scalable and easy.")
print("\nLightPipeline Result (tokens):", light_result['token'])
# Don't forget to stop the SparkSession when done
spark.stop()