URLExtract
URLExtract is a Python library for collecting and extracting URLs from a given text based on locating Top-Level Domains (TLDs). It is currently at version 1.9.0 and is actively maintained, with regular updates to its TLD list and ongoing Python version compatibility.
Warnings
- gotcha URLExtract's TLD-based detection can lead to 'false matches' in certain contexts, such as CSS class names (e.g., `p.bold.name` might be extracted if `.name` is a valid TLD). The library correctly identifies these as valid patterns, but they might not be the intended URLs.
- gotcha Users have reported `urlextract.cachefile.CacheFileError` or issues with custom cache directories not saving TLDs, especially in bundled applications (like PyInstaller) or read-only file systems.
- breaking Support for Python 3.6 has been dropped in recent versions due to underlying dependency changes (e.g., `filelock`). Users on Python 3.6 will encounter errors.
- gotcha Older versions (prior to 1.9.0) might incorrectly parse URLs within Markdown links or have issues with filtering mixed-case hostnames, leading to incomplete or incorrect extractions.
Install
-
pip install urlextract
Imports
- URLExtract
from urlextract import URLExtract
Quickstart
from urlextract import URLExtract extractor = URLExtract() text = "Check out our website: example.com or find us at https://www.another-example.org/path?query=1" urls = extractor.find_urls(text) print(urls) # Expected output: ['example.com', 'https://www.another-example.org/path?query=1']