Zhang-Shasha: Tree Edit Distance in Python
ZSS is a Python library that implements the Zhang-Shasha algorithm for computing the tree edit distance between two ordered labeled trees. It is currently at version 1.2.0 and, while not frequently updated, provides a stable and specialized tool for comparing tree structures. The library can be extended with custom node formats and distance metrics, and optionally leverages `editdist` and `numpy` for enhanced functionality and performance.
Common errors
-
ModuleNotFoundError: No module named 'zss'
cause The 'zss' package is not installed in the current Python environment, or the environment where it's installed is not active.fixEnsure the library is installed: `pip install zss`. If using a virtual environment, activate it before running your script. -
TypeError: object of type 'MyCustomNode' has no len()
cause When using a custom node class instead of `zss.Node`, you haven't provided `get_children`, `get_label`, and/or `label_dist` functions to `zss.simple_distance` or `zss.distance`. The library tries to access attributes or methods it expects by default from `zss.Node`.fixDefine functions to extract children and labels from your custom nodes, and a function to compare their labels. Then pass these to the distance calculation: `simple_distance(A, B, get_children=my_get_children_func, get_label=my_get_label_func, label_dist=my_label_distance_func)`. -
AttributeError: 'Node' object has no attribute 'add_child'
cause The `zss.Node` class uses `addkid()` to add children, not a more generic `add_child()` or `append()` that might be common in other tree implementations.fixWhen building trees with `zss.Node`, use the `addkid()` method: `Node('parent').addkid(Node('child'))`.
Warnings
- gotcha The `simple_distance` function assumes default costs for node insertion, removal, and updates (based on label equality). If your application requires specific, non-uniform, or asymmetric costs, you must use the more general `zss.distance` function and provide custom `insert_cost`, `remove_cost`, and `update_cost` functions.
- gotcha By default, `zss` only knows how to handle nodes with string labels. When using custom tree structures or non-string labels, you must provide custom `get_children`, `get_label`, and `label_dist` functions to `simple_distance` or `distance`.
- gotcha The `editdist` and `numpy` packages are optional 'soft requirements'. Without `editdist`, label comparisons default to a simple equality check (0 if equal, 1 if not). Without `numpy`, performance can be significantly slower, especially for large trees.
Install
-
pip install zss -
pip install zss editdist numpy
Imports
- simple_distance
from zss import simple_distance
- Node
from zss import Node
- distance
from zss import simple_distance (for complex cost models)
from zss import distance
Quickstart
from zss import simple_distance, Node
A = (
Node("f")
.addkid(Node("a")
.addkid(Node("h"))
.addkid(Node("c")
.addkid(Node("l"))
)
)
.addkid(Node("e"))
)
B = (
Node("f")
.addkid(Node("a")
.addkid(Node("d"))
.addkid(Node("c")
.addkid(Node("b"))
)
)
.addkid(Node("e"))
)
distance = simple_distance(A, B)
print(f"Tree edit distance: {distance}")
# Expected: 2