Indonesian G2P (Grapheme-to-Phoneme)

raw JSON →
0.4.2 verified Sat May 09 auth: no python

A library for converting Indonesian text to phoneme sequences using a hybrid approach: rule-based conversion enhanced with an ONNX-based neural model. Current version 0.4.2, with active development on GitHub.

pip install g2p-id-py
error ImportError: cannot import name 'IndonesianG2P' from 'g2p_id'
cause Package not installed or wrong import path (e.g., using 'from g2p_id import g2p_id' or similar).
fix
Install: pip install g2p-id-py. Then import: from g2p_id import IndonesianG2P
error LookupError: Resource punkt not found. Please use the NLTK Downloader to obtain the resource:
cause NLTK data (punkt tokenizer) is missing, required by TweetTokenizer.
fix
Run: import nltk; nltk.download('punkt')
error TypeError: __init__() got an unexpected keyword argument 'model_path'
cause Older versions of IndonesianG2P accepted 'model_path' parameter; it was removed/renamed in later versions.
fix
Check version: pip show g2p-id-py. Use default initialization: IndonesianG2P() or see docs for current constructor.
breaking In v0.4.2, glottal stop is inserted between consecutive vowels (e.g., 'hai' -> ['h', 'a', 'ʔ', 'i']). This changes output compared to earlier versions.
fix If you rely on old behavior, pin to <0.4.2 or adjust your phoneme post-processing.
breaking In v0.4.2, all 'k' graphemes map to 'k' phoneme. Previously, some 'k's were mapped to 'ʔ'. This may affect downstream tasks like ASR.
fix Check your phoneme expectations; update any mappings that assumed 'ʔ' for 'k'.
gotcha The package depends on NLTK's TweetTokenizer. As of v0.3.5, NLTK version is pinned due to backward incompatibility with >=3.8.1. If you have conflicting NLTK versions, it may break.
fix Use the pinned version: pip install 'nltk==3.8' or see issue #16.
gotcha The ONNX model file is loaded with ONNX Runtime. If you need to serialize the IndonesianG2P object (e.g., with pickle), you must use v0.3.7+ where ONNX InferenceSession is wrapped.
fix Upgrade to >=0.3.7, or handle serialization manually.

Instantiate IndonesianG2P and call g2p on a string to get phoneme list.

from g2p_id import IndonesianG2P

g2p = IndonesianG2P()
text = "Halo, apa kabar?"
phonemes = g2p.g2p(text)
print(phonemes)
# Output: ['h', 'a', 'l', 'o', 'ʔ', 'a', 'p', 'a', 'k', 'a', 'b', 'a', 'r']