English Grapheme To Phoneme Conversion
g2p-en is a Python module designed for converting English graphemes (spelling) to phonemes (pronunciation). It is essential for tasks like speech synthesis. The library uses a combination of dictionary lookups, part-of-speech tagging for homograph disambiguation, and a neural network (using NumPy for inference as of v2.0) for out-of-vocabulary words. The current version is 2.1.0, released in late 2019, and its release cadence appears to be infrequent.
Warnings
- breaking Version 2.0 removed TensorFlow as a dependency, replacing it with NumPy for neural network inference. Users upgrading from pre-2.0 versions will no longer require TensorFlow.
- gotcha The library requires downloading specific NLTK data files ('averaged_perceptron_tagger' and 'cmudict') after installation. This step is crucial for the library's functionality and is not performed automatically by `pip install`.
- gotcha While the library attempts to disambiguate homographs (words spelled the same but pronounced differently, like 'refuse' as a verb vs. noun) using part-of-speech tagging, perfect contextual disambiguation is not always guaranteed for all cases.
- gotcha For words not present in its internal dictionaries (Out-Of-Vocabulary words), g2p-en uses a neural network to predict pronunciations. While it makes a 'best guess', accuracy may vary for highly novel, specialized, or irregularly spelled terms.
- gotcha The library performs internal text preprocessing, including spelling out numbers ($250 -> two hundred fifty dollars), expanding common abbreviations (e.g. -> for example), and normalizing contractions (I'm -> I am). This can alter the input text before G2P conversion.
Install
-
pip install g2p-en -
python -m nltk.downloader "averaged_perceptron_tagger" "cmudict"
Imports
- G2p
from g2p_en import G2p
Quickstart
from g2p_en import G2p
texts = [
"I have $250 in my pocket.", # number -> spell-out
"popular pets, e.g. cats and dogs", # e.g. -> for example
"I refuse to collect the refuse around here.", # homograph
"I'm an activationist." # newly coined word
]
g2p = G2p()
for text in texts:
out = g2p(text)
print(out)