MWParserFromHell
MWParserFromHell is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode. It supports Python 3.9+ and is actively developed with frequent releases to support new Python versions and address parsing nuances, typically releasing a few times a year.
Warnings
- breaking Frequent dropping of support for end-of-life Python versions. For example, v0.7.0 dropped Python 3.8, v0.6.6 dropped 3.7, and v0.6.5 dropped 3.6. Ensure your Python environment is kept up-to-date with supported versions (currently Python 3.9+).
- breaking In v0.6.0, `Wikicode.matches()` was updated to recognize underscores as equivalent to spaces, and `Template.get()` gained a `default` parameter. Also, `Wikicode`'s `filter()` methods changed their default `recursive` parameter from `False` to `True`.
- gotcha When installing from source, mwparserfromhell attempts to build a fast C tokenizer extension. If this fails (e.g., due to missing C compilers), it falls back to a slower pure-Python implementation. You can explicitly control this by setting the environment variable `WITH_EXTENSION=0` during installation to force the pure-Python version.
- gotcha mwparserfromhell operates on the raw wikicode. It cannot detect syntax elements produced by template transclusion (i.e., it doesn't expand templates) or resolve complex, cross-over syntax (e.g., `{{echo|''Hello}}, world!''`). For such cases, the parser may treat portions as plain text. The `skip_style_tags=True` parameter in `parse()` can sometimes help with formatting-related issues.
- gotcha While `template['param_name']` (dict-style access) works for `Template` objects, it will raise a `ValueError` if the parameter does not exist. Using `template.get('param_name', default_value)` is generally safer and clearer for handling potentially missing parameters, similar to Python's dictionary `get` method.
- gotcha The nested node depth limit was raised from 40 to 100 in v0.6.6 to better match MediaWiki's parsing behavior. Extremely deeply nested wikicode structures might still hit this limit, potentially leading to incomplete parsing or errors.
Install
-
pip install mwparserfromhell
Imports
- parse
import mwparserfromhell wikicode = mwparserfromhell.parse(text)
Quickstart
import mwparserfromhell
text = """I has a template! {{foo|bar|baz|eggs=spam}} \n== Heading ==\n[[File:Example.jpg|thumb|A caption.]] See it?"""
wikicode = mwparserfromhell.parse(text)
print(wikicode) # Outputs the original wikicode
# Filter for templates
templates = wikicode.filter_templates()
if templates:
template = templates[0]
print(f"Template name: {template.name}")
print(f"Template parameter '1': {template.get(1).value}")
print(f"Template parameter 'eggs': {template.get('eggs').value}")
# Filter for wikilinks (e.g., file captions are wikilinks)
wikilinks = wikicode.filter_wikilinks()
if wikilinks:
print(f"First wikilink: {wikilinks[0].title}")
# Get all headings
headings = wikicode.filter_headings()
if headings:
print(f"First heading: {headings[0].title}")