{"id":9559,"library":"bpemb","title":"Byte-Pair Embeddings (BPEmb)","description":"BPEmb provides byte-pair encodings (BPE) from raw text and maps subword units to pre-trained embeddings for 275 languages. It's designed for NLP tasks requiring efficient subword tokenization and embedding. The current version is 0.3.6, with releases occurring sporadically based on updates to models or features.","status":"active","version":"0.3.6","language":"en","source_language":"en","source_url":"https://github.com/bheinzerling/bpemb","tags":["embeddings","nlp","bpe","subword","language-models","vector-representations"],"install":[{"cmd":"pip install bpemb","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Numerical operations and array handling for embeddings.","package":"numpy"},{"reason":"Progress bars for downloads and processing.","package":"tqdm"}],"imports":[{"symbol":"BPEmb","correct":"from bpemb import BPEmb"}],"quickstart":{"code":"from bpemb import BPEmb\n\n# Initialize BPEmb for English, 100-dim embeddings, 100,000 vocabulary size\n# This will download the model the first time it's run.\nbpemb_en = BPEmb(lang=\"en\", dim=100, vs=100000)\n\n# Encode a sentence into subword units\nencoded_sentence = bpemb_en.encode(\"This is a test sentence for bpemb.\")\nprint(f\"Encoded sentence: {encoded_sentence}\")\n\n# Get embeddings for a single word\nembedding = bpemb_en.embed(\"test\")\nprint(f\"Embedding shape for 'test': {embedding.shape}\")\n\n# Get embeddings for a list of words\nembeddings_list = bpemb_en.embed_words([\"this\", \"is\", \"bpemb\"])\nprint(f\"Embeddings shape for word list: {embeddings_list.shape}\")","lang":"python","description":"Demonstrates how to initialize BPEmb, encode a sentence into subwords, and retrieve embeddings for individual words or lists of words. The first initialization for a specific (lang, dim, vs) combination will trigger a large model download."},"warnings":[{"fix":"Ensure stable internet connection and sufficient disk space. Models are cached locally for subsequent uses. Consider using smaller `dim` or `vs` values if disk space is a concern.","message":"The BPEmb constructor triggers large model downloads (hundreds of MBs to GBs) for each unique (language, dimension, vocabulary_size) combination upon first use. This can consume significant disk space and bandwidth.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If encountering `MemoryError` or performance issues, try reducing the `dim` (embedding dimension) and/or `vs` (vocabulary size) parameters during `BPEmb` initialization. Process large texts in batches if possible.","message":"Loaded models can consume significant RAM (hundreds of MBs or more) depending on the chosen `dim` and `vs` parameters, potentially leading to `MemoryError` on systems with limited resources.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Be aware of the distinct behaviors: `encode` will break unknown words into subwords using BPE, while `embed` will yield a zero vector for any word not in its vocabulary. Handle OOV words explicitly in your application logic based on the desired outcome.","message":"Out-of-Vocabulary (OOV) words are handled differently by `encode` (which performs subword segmentation) and `embed` (which returns a zero vector for unknown words). Users might expect a unified behavior.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Verify your internet connection. Check firewall settings to ensure Python can make outbound connections. If behind a proxy, configure proxy settings for your environment or Python requests.","cause":"Network error preventing the download of pre-trained models. This can be due to no internet connection, firewall issues, or incorrect proxy settings.","error":"urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>"},{"fix":"Ensure the `~/.cache/bpemb` directory (or your custom cache path) is intact and accessible. If files are missing, `BPEmb` will attempt to re-download them automatically upon initialization.","cause":"The cached model files for BPEmb were deleted, moved, or the cache directory is inaccessible/corrupted, and the library cannot locate them.","error":"FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.cache/bpemb/en_100000_100.model'"},{"fix":"Consult the `bpemb` documentation or its source code for an accurate list of supported language codes. Ensure the language code matches one of the officially available options (e.g., 'en' for English, 'es' for Spanish).","cause":"An unsupported or incorrect language code was provided to the `BPEmb` constructor.","error":"ValueError: Language 'xx' not supported. Available languages are: ['en', 'de', ...]"},{"fix":"Reduce the `dim` (embedding dimension) and/or `vs` (vocabulary size) parameters when initializing `BPEmb`. Consider processing text in smaller, manageable batches if possible to distribute memory load.","cause":"Attempting to load a very large model (high `dim` and `vs` parameters) or process an extremely large amount of text at once, exceeding available system RAM.","error":"MemoryError"}]}