LangChain Text Splitters
LangChain Text Splitters (current version 1.1.1) provides a comprehensive set of utilities for breaking down large text documents into smaller, manageable chunks. This is crucial for applications like Retrieval-Augmented Generation (RAG) and fitting content within Language Model context windows. As an integral part of the LangChain ecosystem, it maintains an active and rapid release cadence, closely aligned with other LangChain libraries.
Warnings
- breaking The text splitter modules have been moved from `langchain.text_splitter` to the standalone `langchain-text-splitters` package. Direct imports from `langchain.text_splitter` will no longer work.
- gotcha When using `create_documents()` method, it expects a *list* of strings (or `Document` objects). Passing a single string will result in each character being treated as a separate document.
- gotcha The `chunk_size` parameter for character-based splitters specifies the *target* maximum chunk size. Due to the splitter's logic (e.g., trying to split on specific separators first), the actual chunk length may not be exactly `chunk_size`.
- gotcha Mixing major versions of LangChain ecosystem packages (e.g., `langchain-text-splitters==1.x.x` with `langchain-core==0.3.x`) can lead to compatibility issues and unexpected behavior.
- gotcha Some specialized splitters, like `MarkdownHeaderTextSplitter` and `HTMLHeaderTextSplitter`, do not inherit from the base `TextSplitter` class. This means they might have slightly different method signatures or expectations.
Install
-
pip install langchain-text-splitters
Imports
- RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
- CharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter
- MarkdownHeaderTextSplitter
from langchain_text_splitters import MarkdownHeaderTextSplitter
Quickstart
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Example long text
long_text = (
"LangChain is a framework designed to simplify the creation of applications using large language models. "
"It provides tools for chaining together different components, making it easier to build complex LLM workflows. "
"Text splitting is a fundamental step in processing long documents for LLMs, ensuring that chunks fit within context windows and maintain semantic coherence. "
"The RecursiveCharacterTextSplitter is often the recommended default for general-purpose text."
)
# Initialize the splitter
# chunk_size: maximum size of each chunk (in characters by default)
# chunk_overlap: number of characters to overlap between consecutive chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20
)
# Split the text
chunks = text_splitter.split_text(long_text)
# Print the chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n---")