{"id":4853,"library":"wordfreq","title":"Wordfreq","description":"Wordfreq is a Python library providing high-quality estimates of word frequencies in over 40 languages, based on diverse data sources like books, web text, and social media. The library, currently at version 3.1.1, offers both 'small' and 'large' wordlists for different memory and coverage needs. While packaging updates may continue, the underlying word frequency data is a snapshot through approximately 2021 and is unlikely to be updated further due to concerns about generative AI 'polluting' language usage data. This makes the project primarily in a maintenance mode for its data.","status":"maintenance","version":"3.1.1","language":"en","source_language":"en","source_url":"https://github.com/rspeer/wordfreq/","tags":["natural language processing","nlp","word frequency","linguistics","text analysis","language data"],"install":[{"cmd":"pip install wordfreq","lang":"bash","label":"Standard installation"},{"cmd":"pip install wordfreq[cjk]","lang":"bash","label":"Installation with CJK language support"}],"dependencies":[{"reason":"Core dependency for data handling.","package":"msgpack","optional":false},{"reason":"Core dependency for language code handling.","package":"langcodes","optional":false},{"reason":"Core dependency for tokenization; requires specific version for `regex.Match` class.","package":"regex","optional":false},{"reason":"Optional dependency for Chinese language tokenization.","package":"jieba","optional":true},{"reason":"Optional dependency for Japanese and Korean language tokenization. Requires system package `libmecab-dev`.","package":"mecab-python3","optional":true},{"reason":"Optional dependency for Japanese language tokenization (used with `mecab-python3`).","package":"ipadic","optional":true},{"reason":"Optional dependency for Korean language tokenization (used with `mecab-python3`).","package":"mecab-ko-dic","optional":true}],"imports":[{"symbol":"word_frequency","correct":"from wordfreq import word_frequency"},{"symbol":"zipf_frequency","correct":"from wordfreq import zipf_frequency"},{"note":"The `tokenize` function moved to the top-level `wordfreq` module in v2.0, with `preprocess.preprocess_text` and `lossy_tokenize` introduced for specific preprocessing steps.","wrong":"from wordfreq.preprocess import tokenize","symbol":"tokenize","correct":"from wordfreq import tokenize"}],"quickstart":{"code":"from wordfreq import word_frequency, zipf_frequency\n\n# Get the raw frequency (between 0 and 1)\nfreq_en = word_frequency('the', 'en')\nprint(f\"Frequency of 'the' in English: {freq_en}\")\n\nfreq_fr_cafe = word_frequency('café', 'fr')\nprint(f\"Frequency of 'café' in French: {freq_fr_cafe}\")\n\n# Get the Zipf frequency (logarithmic scale, base-10 logarithm of occurrences per billion words)\nzipf_en = zipf_frequency('computer', 'en')\nprint(f\"Zipf frequency of 'computer' in English: {zipf_en}\")\n\nzipf_nonexistent = zipf_frequency('nonexistentword123', 'en')\nprint(f\"Zipf frequency of 'nonexistentword123' in English: {zipf_nonexistent}\")\n\n# Example with a different wordlist (default is 'best', 'large' or 'small' can be specified)\nzipf_large = zipf_frequency('quantum', 'en', wordlist='large')\nprint(f\"Zipf frequency of 'quantum' (large list) in English: {zipf_large}\")","lang":"python","description":"This quickstart demonstrates how to use the `word_frequency` and `zipf_frequency` functions to retrieve word frequencies in different languages and scales. `word_frequency` returns a decimal between 0 and 1, while `zipf_frequency` returns a value on a human-friendly logarithmic scale."},"warnings":[{"fix":"Be aware that number-containing words may yield different frequencies compared to pre-3.0 versions. Review your logic if your application relies on specific numeric token representations.","message":"In version 3.0, the handling of multi-digit numbers changed. Previously, sequences of two or more digits were grouped into a single token (e.g., '1234' became '0000'), leading to an overestimated frequency. Now, frequencies are distributed across numbers of that shape, incorporating Benford's law and special handling for 4-digit years, providing more realistic estimates. Functions like `iter_wordlist` and `top_n_list` also no longer return multi-digit numbers.","severity":"breaking","affected_versions":">=3.0"},{"fix":"If you require specific preprocessing steps such as combining numbers or Chinese character simplification, explicitly use `wordfreq.lossy_tokenize(text, lang)` instead of `wordfreq.tokenize(text, lang)`.","message":"Version 3.0 (building on changes from 2.0) significantly altered tokenization functions. The `tokenize` function no longer supports a `combine_numbers` option (which was implicitly removed as `lossy_tokenize` provides similar behavior for combining numbers). Additionally, `tokenize` no longer automatically replaces Chinese characters with their Simplified Chinese versions; this transformation is now handled by `lossy_tokenize`.","severity":"breaking","affected_versions":">=3.0"},{"fix":"Upgrade the `regex` library to version `2021.7.6` or newer: `pip install --upgrade regex`.","message":"Wordfreq relies on the `regex` library for tokenization. Versions of `regex` prior to `2021.7.6` do not include the `regex.Match` class, which can lead to import errors or unexpected behavior. Ensure your `regex` installation is up-to-date.","severity":"gotcha","affected_versions":"<2021.7.6 of `regex` dependency"},{"fix":"Be aware that `wordfreq` reflects historical language usage. For analyses requiring current, post-2021 language trends, `wordfreq`'s data may not be representative. Consider alternative methods or acknowledge this data limitation.","message":"The word frequency data provided by `wordfreq` is based on language usage up to approximately 2021 and will not be updated further. This decision was made because generative AI models have 'polluted' online data sources, making it difficult to obtain reliable information about post-2021 human language usage.","severity":"gotcha","affected_versions":"All versions, regarding data content"},{"fix":"Install CJK dependencies using `pip install wordfreq[cjk]`. For MeCab (Japanese/Korean), consult `mecab-python3` documentation for system-level prerequisites like `libmecab-dev`.","message":"Support for Chinese, Japanese, and Korean (CJK) languages requires additional optional dependencies (`jieba`, `mecab-python3`, `ipadic`, `mecab-ko-dic`). For Japanese and Korean tokenization using `mecab-python3`, you may also need to install the `libmecab-dev` system package, which can be complex depending on your operating system.","severity":"gotcha","affected_versions":"All versions, for CJK language support"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}