HuggingFace Math-Verify

0.9.0 · active · verified Sat Apr 11

Math-Verify is a robust Python library from HuggingFace, currently at version 0.9.0, designed for evaluating Large Language Model outputs in mathematical tasks. It provides sophisticated capabilities for parsing and verifying mathematical expressions, including LaTeX and plain numerical formats. The library supports complex features like set theory, equation/inequality comparison, and advanced normalization, aiming to offer higher accuracy in assessing LLM performance on math problems by moving beyond strict format requirements and inflexible comparison logic. It maintains an active development and release cadence.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `parse` to extract mathematical expressions from strings (both LaTeX and plain expressions) and `verify` to check for mathematical equivalence. It highlights the use of `ExtractionConfig` classes and illustrates the default asymmetric behavior for comparing intervals and inequalities.

from math_verify import parse, verify, LatexExtractionConfig, ExprExtractionConfig

# Define extraction configurations
extraction_configs = [LatexExtractionConfig(), ExprExtractionConfig()]

# Parse the gold standard answer (e.g., from a dataset)
gold_answer_text = "${1,3} \cup {2,4}$"
gold_parsed = parse(gold_answer_text, extraction_config=extraction_configs)

# Parse the LLM generated answer
llm_answer_text = "${1,2,3,4}$"
llm_parsed = parse(llm_answer_text, extraction_config=extraction_configs)

# Verify if the LLM's answer is mathematically equivalent to the gold standard
is_correct = verify(gold_parsed, llm_parsed)

print(f"Gold: {gold_answer_text} -> {gold_parsed}")
print(f"LLM: {llm_answer_text} -> {llm_parsed}")
print(f"Are answers equivalent? {is_correct}")

# Another example with an inequality and asymmetric comparison behavior
gold_ineq = parse("1 < x < 2")
llm_interval = parse("(1,2)")
print(f"\nGold (inequality): {gold_ineq}")
print(f"LLM (interval): {llm_interval}")
print(f"Are they equivalent (default)? {verify(gold_ineq, llm_interval)}")

# To allow symmetric comparison (e.g., if gold is interval and pred is inequality)
gold_interval = parse("(1,2)")
llm_ineq = parse("1 < x < 2")
print(f"\nGold (interval): {gold_interval}")
print(f"LLM (inequality): {llm_ineq}")
print(f"Are they equivalent (default)? {verify(gold_interval, llm_ineq)}")
print(f"Are they equivalent (allow_set_relation_comp=True)? {verify(gold_interval, llm_ineq, allow_set_relation_comp=True)}")

view raw JSON →