SPARK Parser Toolkit
SPARK is a lightweight, pure-Python Earley-Algorithm context-free grammar parser toolkit. It enables developers to build parsers and scanners for custom languages or data formats using grammar rules defined as Python docstrings. The current version is 1.9.0, with releases occurring periodically to address Python compatibility and improve internal mechanics.
Common errors
-
ModuleNotFoundError: No module named 'spark'
cause Attempting to import from `spark` instead of `spark_parser`.fixChange your import statements from `from spark import ...` to `from spark_parser import ...`. -
RecursionError: maximum recursion depth exceeded
cause Prior to version 1.9.0, `spark-parser`'s internal `BuildTree` used recursion, which could hit Python's recursion limit with large or deeply nested parse trees. Complex grammars can also indirectly lead to this.fixUpgrade to `spark-parser` version 1.9.0 or higher. If the issue persists with a very complex grammar, review your grammar for excessively deep or recursive rules that could contribute to the problem. -
ValueError: Invalid character: '<char>'
cause Your `GenericScanner` subclass encountered a character in the input string that it doesn't have a rule to tokenize, often due to missing whitespace handling, unhandled special characters, or malformed input.fixReview your `Scanner`'s `tokenize` method. Ensure it handles all possible characters in the input, including whitespace, numbers, symbols, and any other valid tokens. Add rules or skip unknown characters explicitly.
Warnings
- gotcha Starting with version 1.9.0, the internal `BuildTree` mechanism was rewritten from recursive to iterative. While this fixes `RecursionError` for large trees, it's a significant internal change that might subtly affect highly specialized code directly interacting with `BuildTree`'s structure or performance characteristics.
- gotcha SPARK parser's compatibility with very old Python versions (e.g., <3.7) may be inconsistent or require specific `spark-parser` versions. Recent releases (1.9.0+) focus on a 'modern Python style' (e.g., type annotations, `pyproject.toml`).
- gotcha Incorrectly defined grammar rules (e.g., ambiguity, infinite recursion, unreachable productions) can lead to unexpected parsing results, `SyntaxError` exceptions, or infinite loops.
Install
-
pip install spark-parser
Imports
- GenericParser
from spark import GenericParser
from spark_parser import GenericParser
- GenericScanner
from spark_parser import GenericScanner
Quickstart
from spark_parser import GenericParser, GenericScanner
# 1. Define your scanner (lexer) by subclassing GenericScanner
class SimpleCalcScanner(GenericScanner):
def tokenize(self, input_string):
tokens = []
i = 0
while i < len(input_string):
char = input_string[i]
if char.isspace():
i += 1
continue
if char.isdigit():
num_str = ""
while i < len(input_string) and input_string[i].isdigit():
num_str += input_string[i]
i += 1
tokens.append(('NUMBER', int(num_str)))
elif char in "+-*/()":
tokens.append((char, char))
i += 1
else:
raise ValueError(f"Invalid character: {char}")
return tokens
# 2. Define your parser (grammar rules) by subclassing GenericParser
class SimpleCalcParser(GenericParser):
def __init__(self, start_symbol='expr'):
GenericParser.__init__(self, start_symbol)
# Define grammar rules using docstrings for methods starting with 'p_'
def p_expr_add(self, args):
'''
expr ::= expr + term
'''
return args[0] + args[2]
def p_expr_term(self, args):
'''
expr ::= term
'''
return args[0]
def p_term_num(self, args):
'''
term ::= NUMBER
'''
return args[0]
# 3. Instantiate scanner and parser, then tokenize and parse
scanner = SimpleCalcScanner()
parser = SimpleCalcParser()
text_to_parse = "10 + 5"
tokens = scanner.tokenize(text_to_parse)
result = parser.parse(tokens)
# print(f"Parsed result for '{text_to_parse}': {result}") # Expected: 15
assert result == 15, "Parsing failed!"