mechanize
mechanize provides a stateful, programmatic web browsing interface, allowing for opening URLs, following links, submitting forms, and handling cookies. It simulates a web browser's behavior without a GUI or JavaScript engine. The current version is 0.4.10, released in 2023, and it follows a slow release cadence, primarily for maintenance and bug fixes.
Common errors
-
http.client.BadStatusLine: ''
cause The server responded with an invalid or empty HTTP status line. This often indicates a malformed server response, or the server closing the connection unexpectedly.fixCheck the target URL in a regular browser. Ensure your User-Agent is realistic. The server might be blocking your requests due to suspicious headers or rate limiting. Try increasing timeouts or retrying. -
AttributeError: 'NoneType' object has no attribute 'get_header' or 'AttributeError: 'NoneType' object has no attribute 'headers'
cause This error typically occurs when `br.open()` fails to retrieve a valid response object (e.g., due to a network error, DNS failure, or a very quick connection reset), and subsequent code tries to access headers or other attributes on a `None` object.fixWrap `br.open()` calls in a `try-except` block to catch `mechanize.URLError` or `mechanize.HTTPError`. Verify the URL and network connectivity. Inspect `br.response()` if available for details. -
mechanize._response.html.FormNotFoundError: no form matching name ... or nr ...
cause You are trying to select a form that doesn't exist on the current page, or your selection criteria (name, id, index `nr`) do not match any available forms.fixInspect the HTML content of the page (`response.read()`) to identify the correct form attributes (name, id) or its numerical index. You can iterate `for form in br.forms(): print(form)` to see all available forms. -
mechanize._urllib2_fork.HTTPError: HTTP Error 403: Forbidden
cause The server denied access to the resource. This could be due to not respecting `robots.txt`, an invalid or missing User-Agent, IP blocking, or other security measures.fixEnsure `br.set_handle_robots(False)` is set if you intend to ignore `robots.txt`. Set a realistic User-Agent string. If persistent, consider rotating IP addresses or waiting before retrying.
Warnings
- gotcha mechanize does NOT execute JavaScript. It's a 'headless' browser in the sense it has no GUI, but it cannot render or interact with dynamic content generated by JavaScript. If a page relies on JavaScript for content loading, form submission, or navigation, mechanize will not see or interact with it.
- breaking Major breaking changes occurred during the transition from Python 2 to Python 3. Code written for mechanize on Python 2.x is likely incompatible with Python 3.x due to changes in internal modules (e.g., `_mechanize` C module removed) and string/bytes handling.
- gotcha By default, mechanize respects `robots.txt` rules. Many scraping tasks require bypassing this, which can lead to `HTTP Error 403: Forbidden` or simply not accessing desired content.
- gotcha Many modern websites require specific User-Agent headers to display content correctly or to prevent blocking. Without setting a realistic User-Agent, you might receive errors or be served different content.
Install
-
pip install mechanize
Imports
- Browser
from mechanize import Browser
import mechanize br = mechanize.Browser()
Quickstart
import mechanize
import http.cookiejar as cookielib
br = mechanize.Browser()
# Cookie Jar setup (optional but recommended for stateful browsing)
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_robots(False) # Often set to False for scraping
# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Open a page
url = "http://www.example.com/"
# Use a placeholder for environment variables if authentication is needed
# url = os.environ.get('MECHANIZED_TARGET_URL', 'http://www.example.com/')
try:
response = br.open(url)
print(f"Title: {br.title()}")
print(f"Status: {response.code}")
# print(response.read().decode('utf-8'))
except Exception as e:
print(f"An error occurred: {e}")