mrjob - Python MapReduce Framework

raw JSON →
0.7.4 verified Fri May 01 auth: no python maintenance

mrjob is a Python library that allows you to write MapReduce jobs and run them on Hadoop, Amazon EMR, or your local machine. Version 0.7.4 is the latest release; development appears slow with no recent releases.

pip install mrjob
error ModuleNotFoundError: No module named 'yaml'
cause mrjob uses PyYAML for configuration but does not declare it as a dependency in some versions.
fix
Run: pip install pyyaml
error ImportError: No module named mrjob.job
cause The module structure changed; `from mrjob import MRJob` no longer works.
fix
Use: from mrjob.job import MRJob
error mrjob.errors.MrJobError: No runners found (tried ...)
cause mrjob could not find a suitable runner. Often happens when running without specifying a runner or missing dependencies for the chosen runner.
fix
Specify runner via -r flag, e.g., python job.py -r hadoop input.txt. For local, use -r local.
gotcha mrjob 0.7.4 requires `pyyaml` for configuration files (`.mrjob.conf`). Missing it will cause import errors.
fix Install pyyaml: `pip install pyyaml`
deprecated Support for Python 2 was dropped after version 0.5.0. Using Python 3 is required for 0.7.x.
fix Use Python 3.6+.
gotcha Running on Amazon EMR requires `boto3` and proper AWS credentials. Jobs may fail silently if credentials are not configured.
fix Install boto3: `pip install boto3`. Set environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

Word count MapReduce job using mrjob. Run locally with: python word_count.py input.txt

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):
    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount.run()