Transformer Smaller Training Vocab
transformer-smaller-training-vocab is a Python library designed to optimize transformer model training by temporarily reducing the vocabulary size to only include tokens present in the training dataset. This process helps save RAM and speeds up training, especially for domain-specific fine-tuning. The library is currently at version 0.4.2 and has an active, though infrequent, release cadence, with recent updates addressing compatibility and bug fixes.
Common errors
-
RuntimeError: The vocab_size of the new embedding is not correct. Please open an issue on github!
cause This error typically indicates an issue where the `recreate_vocab` function calculates an incorrect vocabulary size, leading to a mismatch when updating the model's embeddings.fixThis was a known bug fixed in version 0.3.3. Upgrade your library to `transformer-smaller-training-vocab>=0.3.3`. -
Package 'transformer-smaller-training-vocab' requires Python >=3.9, <4.0 but the running Python is 3.8.X
cause You are attempting to install or run a recent version of the library (>=0.4.1) on an unsupported Python 3.8 environment.fixUpgrade your Python installation to version 3.9 or newer. For example, use a virtual environment with `python3.9 -m venv .venv`. -
AttributeError: 'AddedToken' object has no attribute 'content'
cause This or similar `AttributeError` related to `AddedToken` objects often points to issues with how special tokens are handled by the tokenizer, particularly after modification or reduction.fixThis class of issues was addressed in versions 0.2.3 and further refined in 0.4.1. Upgrade to `transformer-smaller-training-vocab>=0.4.1` to resolve.
Warnings
- breaking Python 3.8 support was dropped in version 0.4.1. Users on Python 3.8 will encounter installation errors or runtime issues with newer versions.
- gotcha Older versions (prior to 0.4.1 and 0.2.3) had issues correctly handling special tokens, which could lead to unexpected tokenizer behavior or errors during vocabulary reduction/recreation.
- gotcha Prior to version 0.3.3, there was a bug where the vocabulary size was not set correctly when recreating the full embedding after reduction, potentially leading to dimension mismatch errors.
- gotcha The `datasets` library, initially a direct dependency, became an optional dependency in version 0.3.0. If you rely on `datasets` integration, ensure it's installed separately.
Install
-
pip install transformer-smaller-training-vocab
Imports
- reduce_train_vocab_and_context
from transformer_smaller_training_vocab import reduce_train_vocab_and_context
- recreate_vocab
from transformer_smaller_training_vocab import recreate_vocab
Quickstart
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformer_smaller_training_vocab import reduce_train_vocab_and_context, recreate_vocab
# 1. Load a pre-trained model and tokenizer
model_name = "bert-base-uncased" # Using a small, common model for quickstart
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Dummy dataset for demonstration
texts = ["hello world", "this is a test", "another example sentence"]
tokenized_texts = [tokenizer(text, return_tensors="pt") for text in texts]
# 2. Reduce the vocabulary
reduced_tokenizer, reduced_model, added_tokens_during_reduction = reduce_train_vocab_and_context(
model=model,
tokenizer=tokenizer,
tokenized_datasets=[t["input_ids"] for t in tokenized_texts],
model_resize_strategy="embedding_resize",
)
print(f"Original vocab size: {tokenizer.vocab_size}")
print(f"Reduced vocab size: {reduced_tokenizer.vocab_size}")
# You would typically train with reduced_tokenizer and reduced_model here
# 3. Recreate the original vocabulary (after training, if needed)
recreated_model = recreate_vocab(
reduced_model=reduced_model,
reduced_tokenizer=reduced_tokenizer,
orig_tokenizer=tokenizer,
added_tokens_during_reduction=added_tokens_during_reduction,
model_resize_strategy="embedding_resize",
)
print(f"Recreated model vocab size: {recreated_model.config.vocab_size}")
# Verify recreated_model.config.vocab_size == tokenizer.vocab_size