Byte-Pair Embeddings (BPEmb)

0.3.6 · active · verified Fri Apr 17

BPEmb provides byte-pair encodings (BPE) from raw text and maps subword units to pre-trained embeddings for 275 languages. It's designed for NLP tasks requiring efficient subword tokenization and embedding. The current version is 0.3.6, with releases occurring sporadically based on updates to models or features.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates how to initialize BPEmb, encode a sentence into subwords, and retrieve embeddings for individual words or lists of words. The first initialization for a specific (lang, dim, vs) combination will trigger a large model download.

from bpemb import BPEmb

# Initialize BPEmb for English, 100-dim embeddings, 100,000 vocabulary size
# This will download the model the first time it's run.
bpemb_en = BPEmb(lang="en", dim=100, vs=100000)

# Encode a sentence into subword units
encoded_sentence = bpemb_en.encode("This is a test sentence for bpemb.")
print(f"Encoded sentence: {encoded_sentence}")

# Get embeddings for a single word
embedding = bpemb_en.embed("test")
print(f"Embedding shape for 'test': {embedding.shape}")

# Get embeddings for a list of words
embeddings_list = bpemb_en.embed_words(["this", "is", "bpemb"])
print(f"Embeddings shape for word list: {embeddings_list.shape}")

view raw JSON →