Goose3
raw JSON → 3.1.21 verified Fri May 01 auth: no python
Goose3 is an HTML content/article extractor and web scraper for Python 3 (requires Python >=3.9). It extracts the main content, title, authors, metadata (OpenGraph, schema.org), and images from news articles and web pages. The current version is 3.1.21, with irregular releases as fixes accumulate.
pip install goose3 Common errors
error ModuleNotFoundError: No module named 'goose' ↓
cause Attempting to import from the old 'goose' library (Python 2) instead of 'goose3'.
fix
Install goose3: pip install goose3 and import from goose3 import Goose.
error requests.exceptions.MissingSchema: Invalid URL 'example.com/article': No schema supplied. Perhaps you meant http://example.com/article? ↓
cause URL passed to extract() is missing the scheme (http:// or https://).
fix
Prepend 'https://' to the URL before calling extract().
error TypeError: 'NoneType' object is not iterable ↓
cause Occasionally occurs when accessing article.tags or article.authors if extraction fails (e.g., network error or non-article page).
fix
Check that article is not None and that the page contains the expected data.
Warnings
deprecated camelCase methods (e.g., `getTags()`, `getTones()`) are deprecated since v3.1.13. Use snake_case equivalents (`tags`, `tones`). ↓
fix Replace `article.getTags()` with `article.tags`, `article.getTones()` with `article.tones`.
gotcha Goose3 does not handle JavaScript-rendered pages. Only static HTML content is extracted. ↓
fix Use a headless browser like Selenium or Playwright to get the rendered HTML, then pass it to Goose3.
breaking Python 3.7 and 3.8 support removed in v3.1.20. Requires Python >=3.9. ↓
fix Upgrade Python to 3.9 or higher.
gotcha The `extract()` method can raise `requests.exceptions.MissingSchema` if the URL doesn't include a scheme (e.g., 'example.com' instead of 'https://example.com'). ↓
fix Always provide a full URL with http:// or https:// scheme.
Imports
- Goose wrong
from goose import Goosecorrectfrom goose3 import Goose - Article
from goose3.article import Article
Quickstart
from goose3 import Goose
url = 'https://www.bbc.com/news/world-us-canada-68942345'
with Goose() as g:
article = g.extract(url=url)
print(article.title)
print(article.cleaned_text[:200])