Wordfreq
Wordfreq is a Python library providing high-quality estimates of word frequencies in over 40 languages, based on diverse data sources like books, web text, and social media. The library, currently at version 3.1.1, offers both 'small' and 'large' wordlists for different memory and coverage needs. While packaging updates may continue, the underlying word frequency data is a snapshot through approximately 2021 and is unlikely to be updated further due to concerns about generative AI 'polluting' language usage data. This makes the project primarily in a maintenance mode for its data.
Warnings
- breaking In version 3.0, the handling of multi-digit numbers changed. Previously, sequences of two or more digits were grouped into a single token (e.g., '1234' became '0000'), leading to an overestimated frequency. Now, frequencies are distributed across numbers of that shape, incorporating Benford's law and special handling for 4-digit years, providing more realistic estimates. Functions like `iter_wordlist` and `top_n_list` also no longer return multi-digit numbers.
- breaking Version 3.0 (building on changes from 2.0) significantly altered tokenization functions. The `tokenize` function no longer supports a `combine_numbers` option (which was implicitly removed as `lossy_tokenize` provides similar behavior for combining numbers). Additionally, `tokenize` no longer automatically replaces Chinese characters with their Simplified Chinese versions; this transformation is now handled by `lossy_tokenize`.
- gotcha Wordfreq relies on the `regex` library for tokenization. Versions of `regex` prior to `2021.7.6` do not include the `regex.Match` class, which can lead to import errors or unexpected behavior. Ensure your `regex` installation is up-to-date.
- gotcha The word frequency data provided by `wordfreq` is based on language usage up to approximately 2021 and will not be updated further. This decision was made because generative AI models have 'polluted' online data sources, making it difficult to obtain reliable information about post-2021 human language usage.
- gotcha Support for Chinese, Japanese, and Korean (CJK) languages requires additional optional dependencies (`jieba`, `mecab-python3`, `ipadic`, `mecab-ko-dic`). For Japanese and Korean tokenization using `mecab-python3`, you may also need to install the `libmecab-dev` system package, which can be complex depending on your operating system.
Install
-
pip install wordfreq -
pip install wordfreq[cjk]
Imports
- word_frequency
from wordfreq import word_frequency
- zipf_frequency
from wordfreq import zipf_frequency
- tokenize
from wordfreq import tokenize
Quickstart
from wordfreq import word_frequency, zipf_frequency
# Get the raw frequency (between 0 and 1)
freq_en = word_frequency('the', 'en')
print(f"Frequency of 'the' in English: {freq_en}")
freq_fr_cafe = word_frequency('café', 'fr')
print(f"Frequency of 'café' in French: {freq_fr_cafe}")
# Get the Zipf frequency (logarithmic scale, base-10 logarithm of occurrences per billion words)
zipf_en = zipf_frequency('computer', 'en')
print(f"Zipf frequency of 'computer' in English: {zipf_en}")
zipf_nonexistent = zipf_frequency('nonexistentword123', 'en')
print(f"Zipf frequency of 'nonexistentword123' in English: {zipf_nonexistent}")
# Example with a different wordlist (default is 'best', 'large' or 'small' can be specified)
zipf_large = zipf_frequency('quantum', 'en', wordlist='large')
print(f"Zipf frequency of 'quantum' (large list) in English: {zipf_large}")