Tree-sitter Regex
tree-sitter-regex provides the Python bindings for the Tree-sitter regex grammar, enabling high-performance parsing of regular expressions into concrete syntax trees. It allows developers to analyze, transform, and understand regex patterns programmatically. The current version is 0.25.0, with an active but somewhat irregular release cadence.
Warnings
- gotcha The `tree-sitter-regex` package provides a `language()` function that directly returns the pre-compiled `tree_sitter.Language` object. Do not attempt to manually compile the grammar or dynamically load 'regex' using `tree_sitter.Language.build_library` or `Language.load()`, as this can lead to compilation errors or `LanguageNotFound` exceptions.
- gotcha The `tree_sitter.Parser.parse()` method strictly expects a `bytes` object as input, not a Python `str`. Passing a string directly will result in a `TypeError`.
- gotcha When accessing node text (e.g., `node.text`), the returned value is always a `bytes` object. For human-readable output or string manipulation, this `bytes` object must be decoded.
Install
-
pip install tree-sitter-regex
Imports
- language
from tree_sitter_regex import language
Quickstart
import tree_sitter
from tree_sitter_regex import language
# Get the pre-compiled Tree-sitter Language object for regex
REGEX_LANGUAGE = language()
# Create a parser instance
parser = tree_sitter.Parser()
parser.set_language(REGEX_LANGUAGE)
# Define a regex string to parse (must be bytes)
regex_string = r"^([a-zA-Z0-9_\-]+)\s*=\s*(.+)$"
encoded_regex = bytes(regex_string, "utf8")
# Parse the regex string
tree = parser.parse(encoded_regex)
# Print the S-expression representation of the syntax tree
print("--- S-expression Tree ---")
print(tree.root_node.sexp())
# Traverse and print some nodes
print("\n--- Node Details ---")
root = tree.root_node
for child in root.children:
print(f"Type: {child.type}, Text: {child.text.decode('utf8')}, Start: {child.start_point}, End: {child.end_point}")
# Example: Find all `_token_name` nodes
print("\n--- Token Names Found ---")
for node in root.descendant_for_point_range((0,0), (len(encoded_regex), 0)).children:
if node.type == '_token_name':
print(f"Found token name: {node.text.decode('utf8')}")