Floret Python Bindings
Floret is an actively maintained Python library by Explosion (makers of spaCy) that provides compact, full-coverage word vectors using Bloom embeddings, extending the functionalities of fastText. It aims to reduce the size of vector tables significantly while maintaining performance, especially for morphologically rich languages and handling out-of-vocabulary words. The current version is 0.10.5, with a release cadence driven by Python version support and new features for its training functionalities.
Common errors
-
ModuleNotFoundError: No module named 'floret'
cause The 'floret' library is not installed in the current Python environment or the environment is not activated.fixEnsure you have activated the correct virtual environment if using one, then run `pip install floret`. -
AttributeError: 'Model' object has no attribute 'some_method'
cause You are attempting to call a method that does not exist on the `floret` model object, or you have a typo. This might also occur if you are expecting a fastText-specific method that `floret` does not expose or re-implement.fixConsult the `floret` documentation or its GitHub repository for the correct API. Common methods include `get_word_vector`, `get_word_id`, `save_model`, `save_vectors`, and `save_floret_vectors`. -
TypeError: train_supervised() got an unexpected keyword argument 'hashCount'
cause The `hashCount` (and `mode`) arguments for `floret.train_supervised` were added in version 0.10.4. This error indicates you are using an older version of the `floret` library.fixUpgrade your `floret` installation to the latest version: `pip install --upgrade floret`. -
ValueError: Must pass a file path for training data
cause Training functions like `train_unsupervised` and `train_supervised` expect a string path to a file containing training data, not raw text or a file-like object.fixWrite your training data to a text file (e.g., `data.txt`) and pass the path to this file to the training function, for example: `floret.train_unsupervised("data.txt", ...)`.
Warnings
- breaking The binary formats (`.bin` files) saved by `floret` are not compatible with binary models saved by original `fastText` and vice-versa.
- gotcha By default, `floret.train_unsupervised()` and `floret.train_supervised()` use `mode='fasttext'`, which trains and saves original fastText vectors. To leverage floret's Bloom embeddings for compact vectors, you must explicitly set `mode='floret'` during training.
- gotcha It is not currently possible to train floret models iteratively or from pre-trained embeddings directly through the Python API.
- gotcha When integrating floret vectors into spaCy, certain spaCy `Token` attributes and `Vocab` methods behave differently due to the subword embedding nature. Specifically, `token.is_oov` will always be `False` and `nlp.vocab.vectors.most_similar` might not be supported or could throw an error.
- gotcha The method `model.save_model("file.bin")` saves the full floret model, which can be large. To get the highly compact floret vector table for use in applications like spaCy, a separate method `model.save_floret_vectors("file.floret")` is provided.
Install
-
pip install floret
Imports
- floret
import floret
Quickstart
import floret
import os
# Create a dummy data file for training
with open("data.txt", "w", encoding="utf-8") as f:
f.write("This is a sample sentence for floret training.\n")
f.write("Floret is great for compact word vectors.\n")
f.write("More sentences for training the model.\n")
# Train an unsupervised floret model
# IMPORTANT: Use mode="floret" to enable floret's Bloom embeddings.
# The default mode="fasttext" trains original fastText vectors.
model = floret.train_unsupervised(
"data.txt",
model="cbow",
mode="floret",
hashCount=2, # Recommended for floret mode
bucket=50000, # Reduced size hash table
minn=3,
maxn=6,
dim=100,
epoch=10
)
# Get a word vector
vector = model.get_word_vector("floret")
print(f"Vector for 'floret': {vector[:5]}...") # Print first 5 elements
# Save the full model (creates a .bin file)
model.save_model("vectors.bin")
print("Model saved to vectors.bin")
# Export the floret-specific vector table (creates a .floret file)
model.save_floret_vectors("vectors.floret")
print("Floret vectors saved to vectors.floret")
# Clean up dummy files
os.remove("data.txt")
os.remove("vectors.bin")
os.remove("vectors.floret")