HuggingFace Math-Verify
Math-Verify is a robust Python library from HuggingFace, currently at version 0.9.0, designed for evaluating Large Language Model outputs in mathematical tasks. It provides sophisticated capabilities for parsing and verifying mathematical expressions, including LaTeX and plain numerical formats. The library supports complex features like set theory, equation/inequality comparison, and advanced normalization, aiming to offer higher accuracy in assessing LLM performance on math problems by moving beyond strict format requirements and inflexible comparison logic. It maintains an active development and release cadence.
Warnings
- breaking As of version 0.5.0, `math-verify` replaced the direct use of `sympy.FiniteSet` with `FiniteSet` from `latex2sympy2_extended.sets`. If your code directly interacted with `sympy.FiniteSet` objects in conjunction with `math-verify`'s internal set handling, this change might break compatibility or lead to unexpected behavior.
- deprecated The `equations` parameter in `NormalizationConfig` was deprecated in version 0.6.0. Its functionality is now handled internally by the parser.
- gotcha The `verify` function has an intentional asymmetric behavior when comparing interval-like expressions (e.g., `(1,2)`) and inequality-like expressions (e.g., `1 < x < 2`). By default, `verify` might return `True` for `1 < x < 2` (gold) vs. `(1,2)` (prediction), but `False` for `(1,2)` (gold) vs. `1 < x < 2` (prediction). This design prevents models from simply returning the input inequality without solving it.
- gotcha As of version 0.6.2, the parsing timeout mechanism was changed from being per-extraction to global. This means that a single long input with multiple embedded expressions might exhaust the global timeout, even if individual extractions would have completed within their own (now defunct) per-extraction limits.
- gotcha In version 0.8.0, the default logging verbosity was reduced, and internal errors are now logged at the debug level by default. This means you might not see parsing or verification errors in standard log outputs unless `raise_on_error` is set to `True` or logging is configured to show debug messages.
- gotcha The `verify` function's behavior for lists containing a mix of SymPy expressions and strings is optimized for inputs originating from the `parse` function. Directly constructing lists (e.g., `[sympy.Number(0), '0']`) and passing them to `verify` might lead to unexpected `False` results, especially when a SymPy expression on one side should logically match a string on the other.
Install
-
pip install math-verify -
pip install math-verify[antlr4_13_2]
Imports
- parse
from math_verify import parse
- verify
from math_verify import verify
- LatexExtractionConfig
from math_verify import LatexExtractionConfig
- ExprExtractionConfig
from math_verify import ExprExtractionConfig
- StringExtractionConfig
from math_verify import StringExtractionConfig
Quickstart
from math_verify import parse, verify, LatexExtractionConfig, ExprExtractionConfig
# Define extraction configurations
extraction_configs = [LatexExtractionConfig(), ExprExtractionConfig()]
# Parse the gold standard answer (e.g., from a dataset)
gold_answer_text = "${1,3} \cup {2,4}$"
gold_parsed = parse(gold_answer_text, extraction_config=extraction_configs)
# Parse the LLM generated answer
llm_answer_text = "${1,2,3,4}$"
llm_parsed = parse(llm_answer_text, extraction_config=extraction_configs)
# Verify if the LLM's answer is mathematically equivalent to the gold standard
is_correct = verify(gold_parsed, llm_parsed)
print(f"Gold: {gold_answer_text} -> {gold_parsed}")
print(f"LLM: {llm_answer_text} -> {llm_parsed}")
print(f"Are answers equivalent? {is_correct}")
# Another example with an inequality and asymmetric comparison behavior
gold_ineq = parse("1 < x < 2")
llm_interval = parse("(1,2)")
print(f"\nGold (inequality): {gold_ineq}")
print(f"LLM (interval): {llm_interval}")
print(f"Are they equivalent (default)? {verify(gold_ineq, llm_interval)}")
# To allow symmetric comparison (e.g., if gold is interval and pred is inequality)
gold_interval = parse("(1,2)")
llm_ineq = parse("1 < x < 2")
print(f"\nGold (interval): {gold_interval}")
print(f"LLM (inequality): {llm_ineq}")
print(f"Are they equivalent (default)? {verify(gold_interval, llm_ineq)}")
print(f"Are they equivalent (allow_set_relation_comp=True)? {verify(gold_interval, llm_ineq, allow_set_relation_comp=True)}")