mrjob - Python MapReduce Framework
raw JSON → 0.7.4 verified Fri May 01 auth: no python maintenance
mrjob is a Python library that allows you to write MapReduce jobs and run them on Hadoop, Amazon EMR, or your local machine. Version 0.7.4 is the latest release; development appears slow with no recent releases.
pip install mrjob Common errors
error ModuleNotFoundError: No module named 'yaml' ↓
cause mrjob uses PyYAML for configuration but does not declare it as a dependency in some versions.
fix
Run:
pip install pyyaml error ImportError: No module named mrjob.job ↓
cause The module structure changed; `from mrjob import MRJob` no longer works.
fix
Use:
from mrjob.job import MRJob error mrjob.errors.MrJobError: No runners found (tried ...) ↓
cause mrjob could not find a suitable runner. Often happens when running without specifying a runner or missing dependencies for the chosen runner.
fix
Specify runner via
-r flag, e.g., python job.py -r hadoop input.txt. For local, use -r local. Warnings
gotcha mrjob 0.7.4 requires `pyyaml` for configuration files (`.mrjob.conf`). Missing it will cause import errors. ↓
fix Install pyyaml: `pip install pyyaml`
deprecated Support for Python 2 was dropped after version 0.5.0. Using Python 3 is required for 0.7.x. ↓
fix Use Python 3.6+.
gotcha Running on Amazon EMR requires `boto3` and proper AWS credentials. Jobs may fail silently if credentials are not configured. ↓
fix Install boto3: `pip install boto3`. Set environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
Imports
- MRJob
from mrjob.job import MRJob - MRStep
from mrjob.step import MRStep - HadoopJobRunner
from mrjob.runner import HadoopJobRunner - EMRJobRunner
from mrjob.runner import EMRJobRunner
Quickstart
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def combiner(self, word, counts):
yield (word, sum(counts))
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()