Beautiful Soup 4
HTML and XML parsing library. Current version is 4.14.3. Install name is beautifulsoup4 (pip install beautifulsoup4), import name is bs4 (from bs4 import BeautifulSoup). Always specify a parser explicitly — omitting it causes a UserWarning and inconsistent cross-platform behavior.
Warnings
- breaking Install name (beautifulsoup4) and import name (bs4) differ. pip install beautifulsoup4 then from bs4 import BeautifulSoup. Attempting pip install bs4 installs an old, abandoned package that is NOT Beautiful Soup 4.
- breaking The text= parameter to find() and find_all() was renamed to string= in 4.9.0. Using text= raises DeprecationWarning now and will be removed eventually.
- gotcha Omitting the parser argument raises UserWarning and picks a parser automatically based on what's installed — which may differ between dev and production environments. Code may parse HTML differently on different machines.
- gotcha The BeautifulSoup package (without 4) on PyPI is the old, dead Beautiful Soup 3. pip install beautifulsoup installs the wrong package. pip install bs4 installs a different abandoned package.
- gotcha soup.find('tag') returns None if not found, not an empty list. Calling .get_text() or ['href'] on None raises AttributeError.
Install
-
pip install beautifulsoup4 -
pip install beautifulsoup4 lxml -
pip install beautifulsoup4 html5lib
Imports
- BeautifulSoup
from bs4 import BeautifulSoup # Always specify parser explicitly soup = BeautifulSoup(html_content, 'html.parser') # built-in, no extra install soup = BeautifulSoup(html_content, 'lxml') # requires: pip install lxml soup = BeautifulSoup(html_content, 'html5lib') # requires: pip install html5lib
Quickstart
from bs4 import BeautifulSoup
import requests
response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
# Find elements
title = soup.find('title').get_text() # first match
links = soup.find_all('a', href=True) # all <a> with href
divs = soup.select('div.content > p') # CSS selectors via soupsieve
# Navigate
body = soup.body
first_p = soup.body.p
parent = first_p.parent
# Text extraction
text = soup.get_text(separator=' ', strip=True)
# Find with attributes
button = soup.find('button', {'class': 'submit', 'type': 'submit'})
# Find by string content (NOT text= — that's deprecated)
heading = soup.find('h1', string='Welcome')