Spark NLP

6.4.0 · active · verified Sun Apr 12

John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML, providing performant and accurate NLP annotations for machine learning pipelines that scale in a distributed environment. It is currently at version 6.4.0 and releases new versions frequently, often multiple times a month, with a strong focus on LLM integration, multimodal document processing, and pipeline robustness.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize Spark NLP, define a basic NLP pipeline with document assembly, tokenization, and pre-trained word embeddings, and process a Spark DataFrame. It also shows how to use `LightPipeline` for faster inference on single inputs. The `sparknlp.start()` function simplifies Spark session setup, ensuring correct JAR and configuration loading.

import sparknlp
from sparknlp.base import DocumentAssembler, Pipeline, LightPipeline
from sparknlp.annotator import Tokenizer, WordEmbeddingsModel

# 1. Initialize SparkSession with Spark NLP
# Automatically handles Spark NLP JAR dependencies and configures Spark
# Adjust spark_version, spark_memory, and scala_version as needed for your environment
spark = sparknlp.start(spark_version='3.4', spark_memory='16g', scala_version='2.12')

# 2. Define a simple Spark NLP pipeline
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")

# Load a pre-trained Word Embeddings model
# (glove_100d is a small, general-purpose model suitable for quickstarts)
word_embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en")\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

nlp_pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    word_embeddings
])

# 3. Create a Spark DataFrame and process it
data = spark.createDataFrame([["Spark NLP is a powerful library for natural language processing on Apache Spark."]]).toDF("text")

# Fit the pipeline to the data (this often involves downloading models if not cached)
pipeline_model = nlp_pipeline.fit(data)
result = pipeline_model.transform(data)

# 4. Show results
print("\nPipeline Result:")
result.select("token.result", "embeddings.result").show(truncate=False)

# Example of LightPipeline for single-record inference
light_pipeline = LightPipeline(pipeline_model)
light_result = light_pipeline.annotate("Spark NLP makes NLP scalable and easy.")
print("\nLightPipeline Result (tokens):", light_result['token'])

# Don't forget to stop the SparkSession when done
spark.stop()

view raw JSON →