SetFit
SetFit is a Python library for efficient few-shot learning using Sentence Transformers. It enables training accurate text classifiers with minimal labeled data by finetuning pre-trained Sentence Transformer models. The library is prompt-free, fast to train, and offers multilingual support. The current version is 1.1.3, and the project maintains an active release cadence with frequent patch updates addressing compatibility and minor fixes, alongside larger feature releases.
Common errors
-
AttributeError: 'SetFitModel' object has no attribute 'fit'
cause Attempting to use the `model.fit()` method, which was removed in SetFit v1.0.0.fixThe training API was refactored. Use `SetFitTrainer` with a `TrainingArguments` instance and call `trainer.train()` instead. -
TypeError: SetFitTrainer.__init__() got an unexpected keyword argument 'num_epochs' (or 'batch_size', 'learning_rate', etc.)
cause Training hyperparameters (e.g., `num_epochs`, `batch_size`, `learning_rate`) were passed directly to `SetFitTrainer`'s constructor, but in v1.0.0 and later, they must be encapsulated within a `TrainingArguments` object.fixCreate an instance of `TrainingArguments` and pass these parameters to it, then provide the `TrainingArguments` instance to the `SetFitTrainer` via the `args` parameter. -
RuntimeError: [enforce fail at inline_container.cc:471] PytorchStreamWriter failed writing file data/...
cause This error during `trainer.train()` often indicates a corrupted PyTorch cache or insufficient disk space.fixTry clearing the PyTorch Sentence Transformers cache (`~/.cache/torch/sentence_transformers/`) and ensure your system has enough free disk space. -
Model outputs the exact same label regardless of input or very low accuracy (e.g., 60-63%) on a classification task.
cause This can indicate overfitting to the majority class, severely imbalanced datasets, or 'collapsing' of the embedding model where it fails to distinguish between classes.fixInspect class distribution of your dataset and balance if necessary. Reduce `num_epochs` or `num_iterations`, adjust `learning_rate`, or use techniques like early stopping. Evaluate with metrics appropriate for imbalanced data (e.g., F1-score, precision, recall).
Warnings
- breaking SetFit v1.0.0 introduced significant API changes, particularly for the `SetFitTrainer` and how training arguments are handled. The old `model.fit()` method was removed, and training parameters were moved into a `TrainingArguments` dataclass.
- deprecated Python 3.7 support was deprecated starting from SetFit v1.1.0.
- gotcha Frequent compatibility issues can arise with newer versions of core dependencies like `datasets`, `transformers`, and `sentence-transformers`. Patch releases of SetFit (`v1.1.3`, `v1.1.2`) often address these specific issues.
- gotcha Reproducibility of trained models across different SetFit versions (e.g., v0.6.0 vs v1.0.3) can be challenging due to changes in random seed handling for the model's head initialization and different sentence pair sampling methods.
- gotcha The `report_to="none"` argument in `TrainingArguments` was sometimes ignored, leading to unexpected logging behavior if Weights & Biases or Tensorboard were installed.
Install
-
pip install setfit -
pip install torch --index-url https://download.pytorch.org/whl/cu118 && pip install setfit
Imports
- SetFitModel
from setfit import SetFitModel
- SetFitTrainer
from setfit import SetFitTrainer
- TrainingArguments
from setfit import TrainingArguments
- SupConLoss
from setfit.modeling import SupConLoss
from setfit import SupConLoss
Quickstart
from datasets import load_dataset
from setfit import SetFitModel, SetFitTrainer, TrainingArguments, sample_dataset
from sentence_transformers.losses import CosineSimilarityLoss
# 1. Initialize a SetFit model
model = SetFitModel.from_pretrained("BAAI/bge-small-en-v1.5")
# 2. Load and prepare a dataset (e.g., sst2 for sentiment classification)
dataset = load_dataset("SetFit/sst2")
# Simulate few-shot regime: 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
eval_dataset = dataset["validation"]
# Optional: Map dataset columns if they are not 'text' and 'label'
column_mapping = {"sentence": "text", "label": "label"}
# 3. Define TrainingArguments
training_args = TrainingArguments(
batch_size=16,
num_iterations=20, # Number of text pairs to generate for contrastive learning
num_epochs=1, # Number of epochs to use for contrastive learning
learning_rate=2e-5,
seed=42,
evaluation_strategy="epoch",
save_strategy="epoch"
)
# 4. Create SetFitTrainer
trainer = SetFitTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss, # Loss function for contrastive learning
metric="accuracy",
column_mapping=column_mapping
)
# 5. Train the model
trainer.train()
# 6. Evaluate the model
metrics = trainer.evaluate()
print(f"Evaluation Metrics: {metrics}")
# 7. Make predictions
sentences = ["The movie was great!", "I didn't like the food."]
predictions = model.predict(sentences)
print(f"Predictions: {predictions}")
# 8. Push model to Hugging Face Hub (requires `huggingface_hub` login)
# trainer.push_to_hub("my-awesome-setfit-model")