Byte-Pair Embeddings (BPEmb)
BPEmb provides byte-pair encodings (BPE) from raw text and maps subword units to pre-trained embeddings for 275 languages. It's designed for NLP tasks requiring efficient subword tokenization and embedding. The current version is 0.3.6, with releases occurring sporadically based on updates to models or features.
Common errors
-
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
cause Network error preventing the download of pre-trained models. This can be due to no internet connection, firewall issues, or incorrect proxy settings.fixVerify your internet connection. Check firewall settings to ensure Python can make outbound connections. If behind a proxy, configure proxy settings for your environment or Python requests. -
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.cache/bpemb/en_100000_100.model'
cause The cached model files for BPEmb were deleted, moved, or the cache directory is inaccessible/corrupted, and the library cannot locate them.fixEnsure the `~/.cache/bpemb` directory (or your custom cache path) is intact and accessible. If files are missing, `BPEmb` will attempt to re-download them automatically upon initialization. -
ValueError: Language 'xx' not supported. Available languages are: ['en', 'de', ...]
cause An unsupported or incorrect language code was provided to the `BPEmb` constructor.fixConsult the `bpemb` documentation or its source code for an accurate list of supported language codes. Ensure the language code matches one of the officially available options (e.g., 'en' for English, 'es' for Spanish). -
MemoryError
cause Attempting to load a very large model (high `dim` and `vs` parameters) or process an extremely large amount of text at once, exceeding available system RAM.fixReduce the `dim` (embedding dimension) and/or `vs` (vocabulary size) parameters when initializing `BPEmb`. Consider processing text in smaller, manageable batches if possible to distribute memory load.
Warnings
- gotcha The BPEmb constructor triggers large model downloads (hundreds of MBs to GBs) for each unique (language, dimension, vocabulary_size) combination upon first use. This can consume significant disk space and bandwidth.
- gotcha Loaded models can consume significant RAM (hundreds of MBs or more) depending on the chosen `dim` and `vs` parameters, potentially leading to `MemoryError` on systems with limited resources.
- gotcha Out-of-Vocabulary (OOV) words are handled differently by `encode` (which performs subword segmentation) and `embed` (which returns a zero vector for unknown words). Users might expect a unified behavior.
Install
-
pip install bpemb
Imports
- BPEmb
from bpemb import BPEmb
Quickstart
from bpemb import BPEmb
# Initialize BPEmb for English, 100-dim embeddings, 100,000 vocabulary size
# This will download the model the first time it's run.
bpemb_en = BPEmb(lang="en", dim=100, vs=100000)
# Encode a sentence into subword units
encoded_sentence = bpemb_en.encode("This is a test sentence for bpemb.")
print(f"Encoded sentence: {encoded_sentence}")
# Get embeddings for a single word
embedding = bpemb_en.embed("test")
print(f"Embedding shape for 'test': {embedding.shape}")
# Get embeddings for a list of words
embeddings_list = bpemb_en.embed_words(["this", "is", "bpemb"])
print(f"Embeddings shape for word list: {embeddings_list.shape}")