MS-COCO Caption Evaluation
pycocoevalcap provides Python 3 support for evaluating image captions using standard MS-COCO metrics (BLEU, METEOR, ROUGE-L, CIDEr, SPICE). It is derived from the original Python 2.7 coco-caption repository and depends on the COCO API. The latest version is 1.2, released in November 2020, indicating a maintenance-oriented release cadence.
Warnings
- gotcha Java 1.8.0 is a mandatory runtime dependency for SPICE and the PTBTokenizer. Ensure Java is installed and properly configured in your environment PATH, or you may encounter `java.lang.UnsatisfiedLinkError` or `CalledProcessError` issues related to Java execution during evaluation.
- gotcha The `pycocotools` dependency can be challenging to install, especially on Windows or if specific C/C++ build tools are missing. It often requires a C compiler (e.g., build tools for Visual Studio on Windows, or `gcc` on Linux/macOS) to compile its C extensions.
- gotcha SPICE (Semantic Propositional Image Caption Evaluation) automatically downloads Stanford CoreNLP models on its first run. This process can fail due to network issues, incorrect permissions for caching, or Java environment problems, leading to 'Could not cache item for SPICE' or similar errors.
- gotcha The CIDEr metric, due to its TF-IDF weighting, may return a score of 0 when evaluating only a single ground truth-prediction pair. It's designed for corpus-level evaluation.
- gotcha Users have reported issues with METEOR score computation, sometimes leading to `subprocess.CalledProcessError` or incorrect score aggregation, possibly due to changes in the underlying Java METEOR implementation or inconsistencies in data processing.
Install
-
pip install pycocoevalcap
Imports
- COCOEvalCap
from pycocoevalcap.eval import COCOEvalCap
- Bleu
from pycocoevalcap.bleu.bleu import Bleu
- Meteor
from pycocoevalcap.meteor.meteor import Meteor
- Rouge
from pycocoevalcap.rouge.rouge import Rouge
- Cider
from pycocoevalcap.cider.cider import Cider
- Spice
from pycocoevalcap.spice.spice import Spice
Quickstart
import json
from pycocoevalcap.eval import COCOEvalCap
# Mock ground truth and predicted captions data
# In a real scenario, these would be loaded from JSON files
# 'gts' should map image_id to a list of ground truth captions
# 'res' should map image_id to a list of predicted captions
gts_data = {
"annotations": [
{"image_id": 1, "id": 101, "caption": "A man is riding a bicycle."},
{"image_id": 1, "id": 102, "caption": "A person on a bike on a street."},
{"image_id": 2, "id": 201, "caption": "Two dogs playing in the grass."},
{"image_id": 2, "id": 202, "caption": "Dogs are running on a lawn."}
]
}
res_data = [
{"image_id": 1, "caption": "A man cycling on a road.", "id": 301},
{"image_id": 2, "caption": "Two puppies in a field.", "id": 302}
]
# To initialize COCOEvalCap, you need COCO objects for ground truth and results.
# These COCO objects are typically created from JSON files matching the COCO format.
# For a quickstart, we'll manually structure the data to match expected input.
# The COCO object expects a dictionary with 'images' and 'annotations' keys.
# We only need 'annotations' for caption evaluation.
# Mock COCO objects (simplified for quickstart, actual COCO objects handle more fields)
class MockCoco:
def __init__(self, data):
self.anns = {ann['id']: ann for ann in data.get('annotations', [])}
self.imgToAnns = {}
for ann in data.get('annotations', []):
self.imgToAnns.setdefault(ann['image_id'], []).append(ann)
def loadRes(self, res_json_or_list):
# For simplicity, just store results. COCO.loadRes is more complex.
res_anns = []
for r in res_json_or_list:
# Assign a unique ID if not present, similar to COCO API behavior
if 'id' not in r:
r['id'] = max(self.anns.keys(), default=0) + len(res_anns) + 1
res_anns.append(r)
res_coco = MockCoco({'annotations': res_anns})
return res_coco
def getImgIds(self):
return list(self.imgToAnns.keys())
def loadAnns(self, ids):
return [self.anns[i] for i in ids]
# Initialize Mock COCO objects
# gts_coco_obj = COCO(gts_json_path) # In a real application
gts_coco_obj = MockCoco(gts_data)
# res_coco_obj = gts_coco_obj.loadRes(res_json_path) # In a real application
res_coco_obj = gts_coco_obj.loadRes(res_data)
eval_ids = gts_coco_obj.getImgIds()
cocoEval = COCOEvalCap(gts_coco_obj, res_coco_obj, eval_ids)
cocoEval.evaluate()
print("Evaluation results:")
for metric, score in cocoEval.eval.items():
print(f"{metric}: {score:.3f}")