Transformer Smaller Training Vocab

0.4.2 · active · verified Thu Apr 16

transformer-smaller-training-vocab is a Python library designed to optimize transformer model training by temporarily reducing the vocabulary size to only include tokens present in the training dataset. This process helps save RAM and speeds up training, especially for domain-specific fine-tuning. The library is currently at version 0.4.2 and has an active, though infrequent, release cadence, with recent updates addressing compatibility and bug fixes.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `reduce_train_vocab_and_context` to create a model and tokenizer with a reduced vocabulary based on a provided dataset, and then `recreate_vocab` to restore the original vocabulary. This process is useful for memory-efficient training of transformer models.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformer_smaller_training_vocab import reduce_train_vocab_and_context, recreate_vocab

# 1. Load a pre-trained model and tokenizer
model_name = "bert-base-uncased" # Using a small, common model for quickstart
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Dummy dataset for demonstration
texts = ["hello world", "this is a test", "another example sentence"]
tokenized_texts = [tokenizer(text, return_tensors="pt") for text in texts]

# 2. Reduce the vocabulary
reduced_tokenizer, reduced_model, added_tokens_during_reduction = reduce_train_vocab_and_context(
    model=model,
    tokenizer=tokenizer,
    tokenized_datasets=[t["input_ids"] for t in tokenized_texts],
    model_resize_strategy="embedding_resize",
)

print(f"Original vocab size: {tokenizer.vocab_size}")
print(f"Reduced vocab size: {reduced_tokenizer.vocab_size}")

# You would typically train with reduced_tokenizer and reduced_model here

# 3. Recreate the original vocabulary (after training, if needed)
recreated_model = recreate_vocab(
    reduced_model=reduced_model,
    reduced_tokenizer=reduced_tokenizer,
    orig_tokenizer=tokenizer,
    added_tokens_during_reduction=added_tokens_during_reduction,
    model_resize_strategy="embedding_resize",
)

print(f"Recreated model vocab size: {recreated_model.config.vocab_size}")
# Verify recreated_model.config.vocab_size == tokenizer.vocab_size

view raw JSON →