Tree-sitter XML & DTD Grammars
tree-sitter-xml provides pre-compiled Tree-sitter grammars for XML and DTD. It enables fast, robust parsing of XML and DTD documents within Python applications by integrating with the `tree-sitter` library. The current version is 0.7.0, with updates typically coinciding with upstream Tree-sitter grammar improvements or core library changes.
Warnings
- gotcha `tree-sitter-xml` itself doesn't provide parsing functions directly. It exposes `language()` and `dtd_language()` to retrieve pre-compiled grammars, which must then be used with `tree_sitter.Parser` for actual parsing. You need both libraries.
- breaking Updates to the underlying `tree-sitter` library (especially major versions or breaking changes in its Python bindings) may introduce API changes that could affect how `tree-sitter-xml`'s grammars interact with the parser.
- gotcha XML and DTD grammars are distinct. Use `tree_sitter_xml.language()` for parsing XML documents and `tree_sitter_xml.dtd_language()` for parsing DTD files; they are not interchangeable.
- gotcha Pre-compiled grammars (`.so`, `.dylib`, or `.dll` files) are platform-specific. While `tree-sitter-xml` aims to distribute compatible binaries, issues can arise in unusual environments (e.g., exotic OS, specific Python distributions) or if custom grammar compilation is attempted.
Install
-
pip install tree-sitter-xml -
pip install tree-sitter
Imports
- language
from tree_sitter_xml import language
- dtd_language
from tree_sitter_xml import dtd_language
Quickstart
import tree_sitter
from tree_sitter_xml import language
# Load the XML grammar
XML_LANGUAGE = language()
# Initialize the parser
parser = tree_sitter.Parser()
parser.set_language(XML_LANGUAGE)
# Sample XML string
xml_code = """
<root>
<item id="1">Value 1</item>
<item id="2">Value 2</item>
</root>
"""
# Parse the XML
tree = parser.parse(xml_code.encode('utf8'))
# Print the S-expression (a common way to inspect the parse tree)
print(f"Parsed XML Tree S-expression:\n{tree.root_node.sexp()}")
# Example of traversing a node (e.g., finding the 'item' elements)
root_node = tree.root_node
item_nodes = [child for child in root_node.children if child.type == 'element' and child.text.decode('utf8').strip().startswith('<item')]
print(f"\nFound {len(item_nodes)} 'item' elements.")
if item_nodes:
print(f"First item's text: {item_nodes[0].text.decode('utf8')}")