mwxml
raw JSON → 0.3.8 verified Fri May 01 auth: no python
A set of utilities for processing MediaWiki XML dump data. Currently at version 0.3.8, with irregular releases.
pip install mwxml Common errors
error AttributeError: module 'mwxml' has no attribute 'Dump' ↓
cause Common installation issue where the package is not installed correctly or version mismatch.
fix
Ensure mwxml is installed (pip install mwxml) and import with import mwxml. Then use mwxml.Dump.
error OSError: [Errno 22] Invalid argument ↓
cause Opening the dump file in text mode instead of binary mode.
fix
Open the file with open('dump.xml', 'rb') (binary mode).
error TypeError: cannot use a string pattern on a bytes-like object ↓
cause Passing a file opened in text mode to Dump.from_file.
fix
Use binary mode: open('dump.xml', 'rb').
Warnings
gotcha The Dump.from_file expects a file opened in binary mode ('rb'), not text mode. ↓
fix Always open the dump file with open('file.xml', 'rb').
gotcha Revision.text may be None if the revision has been deleted or suppressed. Always check for None before processing. ↓
fix Use: if revision.text is not None: process(revision.text)
deprecated The mwxml.Dump constructor is deprecated in favor of Dump.from_file. ↓
fix Use mwxml.Dump.from_file() instead of mwxml.Dump().
Imports
- Dump
import mwxml - Page
from mwxml import Page
Quickstart
import mwxml
dump = mwxml.Dump.from_file(open('example.xml', 'rb'))
for page in dump.pages:
print(page.title)
for revision in page.revisions:
print(revision.text)