mwxml

raw JSON →
0.3.8 verified Fri May 01 auth: no python

A set of utilities for processing MediaWiki XML dump data. Currently at version 0.3.8, with irregular releases.

pip install mwxml
error AttributeError: module 'mwxml' has no attribute 'Dump'
cause Common installation issue where the package is not installed correctly or version mismatch.
fix
Ensure mwxml is installed (pip install mwxml) and import with import mwxml. Then use mwxml.Dump.
error OSError: [Errno 22] Invalid argument
cause Opening the dump file in text mode instead of binary mode.
fix
Open the file with open('dump.xml', 'rb') (binary mode).
error TypeError: cannot use a string pattern on a bytes-like object
cause Passing a file opened in text mode to Dump.from_file.
fix
Use binary mode: open('dump.xml', 'rb').
gotcha The Dump.from_file expects a file opened in binary mode ('rb'), not text mode.
fix Always open the dump file with open('file.xml', 'rb').
gotcha Revision.text may be None if the revision has been deleted or suppressed. Always check for None before processing.
fix Use: if revision.text is not None: process(revision.text)
deprecated The mwxml.Dump constructor is deprecated in favor of Dump.from_file.
fix Use mwxml.Dump.from_file() instead of mwxml.Dump().

Opens a MediaWiki XML dump file and iterates over pages and revisions.

import mwxml

dump = mwxml.Dump.from_file(open('example.xml', 'rb'))
for page in dump.pages:
    print(page.title)
    for revision in page.revisions:
        print(revision.text)