Transformers Stream Generator
This is a text generation method which returns a generator, streaming out each token in real-time during inference, based on Huggingface/Transformers. It provides a simple way to enable token-by-token streaming for Hugging Face `transformers` models, often used for large language models (LLMs). The library is currently at version 0.0.5 and appears to be in an early development stage with updates released as features or fixes are integrated.
Common errors
-
ImportError: cannot import name 'BeamSearchScorer' from 'transformers' (unknown location)
cause This error typically indicates an incompatibility between the `transformers-stream-generator` library and your installed version of Hugging Face `transformers`. The `BeamSearchScorer` class's location or signature might have changed in a newer `transformers` release, or an older `transformers-stream-generator` is not compatible with your `transformers` version.fixEnsure both `transformers-stream-generator` and `transformers` are updated to their latest versions. If the issue persists, try pinning `transformers` to a version known to be compatible (e.g., `pip install transformers==4.30.0` and test compatibility). -
No streaming output / Generator does not yield tokens
cause The generator might not yield tokens if the required generation parameters are not correctly set. Common causes include `do_sample=True` being omitted when `do_stream=True` is used, or `num_beams` being set to a value other than 1.fixVerify that your `model.generate` call includes `do_stream=True`, `do_sample=True`, and `num_beams=1` for optimal streaming behavior. -
ERROR: Could not build wheels for transformers-stream-generator, which is required to install pyproject.toml-based projects.
cause This build error often occurs when essential build-time dependencies like `wheel` or `setuptools` are missing or outdated, or if `pip` itself is an older version. It's common in environments where Python packages are installed without proper build tools.fixUpgrade `pip` to the latest version (`python -m pip install --upgrade pip`) and ensure `wheel` and `setuptools` are installed: `pip install wheel setuptools`.
Warnings
- deprecated The library modifies the pretrained model configuration directly to control generation, which Hugging Face Transformers considers a deprecated strategy. This approach may lead to breaking changes in future versions of the `transformers` library.
- gotcha For `do_stream=True` to function correctly, `do_sample=True` must also be set in the `model.generate` function. Failing to do so can result in non-streaming output or unexpected behavior.
- gotcha Streaming generation with `transformers-stream-generator` might not work as expected or at all if `num_beams` is set to a value greater than 1 (i.e., when using beam search).
Install
-
pip install transformers-stream-generator
Imports
- init_stream_support
from transformers_stream_generator import init_stream_support
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers_stream_generator import init_stream_support
import os
# Initialize streaming support
init_stream_support()
# Load model and tokenizer (e.g., a small GPT-2 for demonstration)
# Replace with your desired model
model_name = os.environ.get('TRANSFORMERS_MODEL', 'gpt2')
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Encode input
input_text = "Hello, I am a language model and I can"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
# Generate text with streaming enabled
# do_stream=True requires do_sample=True and typically num_beams=1
print(f"Generating with {model_name} in streaming mode...")
generator = model.generate(
input_ids,
max_new_tokens=50,
do_stream=True,
do_sample=True, # Required for do_stream=True
temperature=0.7,
top_k=50,
top_p=0.95,
num_beams=1 # Streaming generally works best with num_beams=1
)
# Iterate and print tokens as they are generated
print(input_text, end="")
for token_id in generator:
word = tokenizer.decode(token_id, skip_special_tokens=True)
print(word, end="", flush=True)
print("\n\nGeneration complete.")