DAWG-Python
Pure-python reader for DAWGs (Directed Acyclic Word Graphs / Deterministic Acyclic Finite State Automata). It's designed to load and query existing DAWG files, often created by the dawgdic C++ library or the DAWG Python C extension, but can also build small DAWGs from sorted word lists. The current version is 0.7.2, with releases occurring infrequently as needed.
Common errors
-
FileNotFoundError: [Errno 2] No such file or directory: 'my_dictionary.dawg'
cause The DAWG file you are trying to load does not exist at the specified path, or the path is incorrect.fixDouble-check the file path. Ensure the `.dawg` file exists and is accessible from where your script is running. Use an absolute path if unsure. -
TypeError: 'str' object cannot be interpreted as an integer
cause You are attempting to use `IntDAWG` with payloads that are not integers (e.g., strings, floats).fixConvert all payloads to integers before constructing the `IntDAWG`. For example: `IntDAWG([('word', int(value)) for word, value in data])`. -
AttributeError: 'DAWG' object has no attribute 'add_word'
cause You are attempting to modify a `DAWG` object after it has been created or loaded. DAWG structures are immutable.fixTo 'update' a DAWG, you must build a completely new `DAWG` object from the desired set of words. There is no in-place modification.
Warnings
- gotcha The `dawg-python` library is primarily a *reader* for DAWG files. While it can build DAWGs from Python lists, for very large dictionaries, the C++ `dawgdic` library or the `DAWG-Python C extension` are recommended for efficient DAWG construction.
- gotcha DAWG objects (both `DAWG` and `IntDAWG`) are immutable once created or loaded from a file. You cannot add, remove, or modify words/payloads in place.
- gotcha When building a DAWG using `DAWG(iterable_of_words)` or `IntDAWG(iterable_of_tuples)`, the input iterable should be *sorted alphabetically* for optimal performance. If not sorted, the library will sort it internally, which can be slow for large inputs.
- gotcha `IntDAWG` is specifically designed for string keys with *integer* payloads. Passing non-integer values as payloads will result in a `TypeError`.
Install
-
pip install dawg-python
Imports
- DAWG
from dawg_python import DAWG
from dawg import DAWG
- IntDAWG
from dawg_python import IntDAWG
from dawg import IntDAWG
Quickstart
import os
from dawg import DAWG, IntDAWG
# 1. Create a sample DAWG file (in a real scenario, this might come from dawgdic)
words_to_build = ['apple', 'apricot', 'banana', 'cat', 'dog']
# For large sets, words should be pre-sorted for performance.
temp_dawg = DAWG(words_to_build)
dawg_file_path = 'sample_data.dawg'
temp_dawg.save(dawg_file_path)
# 2. Load the DAWG from a file (primary use case)
loaded_dawg = DAWG().load(dawg_file_path)
# 3. Query the loaded DAWG
print(f"Is 'apple' in DAWG? {'apple' in loaded_dawg}")
print(f"Words starting with 'a': {list(loaded_dawg.keys('a'))}")
print(f"Longest prefix for 'apricot': {loaded_dawg.longest_prefix('apricot')}")
# Clean up the temporary file
os.remove(dawg_file_path)
# Example with IntDAWG for words with integer payloads
int_data = [('hello', 10), ('world', 20)]
int_dawg_obj = IntDAWG(int_data)
print(f"Value for 'world': {int_dawg_obj['world']}")