TensorFlow Text
TensorFlow Text is a library providing text-related operations, modules, and subgraphs for TensorFlow. It facilitates common text preprocessing tasks required by text-based models and offers features useful for sequence modeling not found in core TensorFlow. The library is actively maintained and typically releases new versions in lockstep with major and minor TensorFlow releases.
Warnings
- breaking TensorFlow Text versions are tightly coupled with TensorFlow versions. Installing a `tensorflow-text` version that does not precisely match the minor version of your installed `tensorflow` can lead to import errors or runtime issues.
- gotcha After TensorFlow Text version 2.10, pre-built pip packages are only provided for Linux x86_64 and Intel-based Macs. Users on other platforms (e.g., Windows, Aarch64, Apple Silicon Macs) may need to build from source.
- gotcha Older versions of `FastWordpieceTokenizer` and `WhitespaceTokenizer` contained memory safety bugs (e.g., concerning `StringVocab` lifetime or out-of-bounds reads).
- gotcha Some text operations in older versions had input size limitations (e.g., using `int16_t`), which could cause issues with large inputs.
- gotcha Punctuation definition mismatches between different Unicode versions were observed in earlier releases, potentially leading to inconsistent tokenization.
- deprecated The `use_unique_shared_resource_name` option was removed in version 2.16.1. Code relying on this option will break.
Install
-
pip install tensorflow-text==2.20.1 -
pip install -U tensorflow-text
Imports
- tensorflow_text
import tensorflow_text as tf_text
Quickstart
import tensorflow as tf
import tensorflow_text as tf_text
# Create a WhitespaceTokenizer
tokenizer = tf_text.WhitespaceTokenizer()
# Input text as a TensorFlow tensor
text_tensor = tf.constant(["Hello TensorFlow Text!", "This is a great library."])
# Tokenize the text
tokens = tokenizer.tokenize(text_tensor)
# Print the tokens (RaggedTensor output)
print("Original text:", text_tensor.numpy())
print("Tokenized text:", tokens.numpy())