Delimiter Detector
The `detect-delimiter` Python library, currently at version 0.1.1 and last released in July 2018, provides a simple function to automatically identify the delimiter used in various ad-hoc file formats like CSV or TSV. It primarily operates by counting character frequencies within an input string. The library exposes a single `detect()` function, making it straightforward to use for basic delimiter detection needs. Its release cadence appears to be sporadic or ceased, indicating a stable but not actively developed state.
Common errors
-
detect_delimiter doesn't consider quotation and escaping, and hence can easily miss the correct separator if it occurs more often because it's escaped.
cause The `detect()` function performs a basic character frequency count without a full understanding of CSV format rules, such as quoted fields. If a common delimiter like a comma appears frequently within quoted text, it may be incorrectly identified as the primary delimiter.fixFor files known to follow CSV standards with quoting, use Python's `csv.Sniffer`. If a simplified approach is still desired and the problem persists, use the `whitelist` parameter with only the *true* expected delimiters, e.g., `detect(text, whitelist=[';'])`. -
None returned as delimiter when an expected delimiter is clearly present in the text.
cause The expected delimiter might not be in the default `whitelist` `[',', ';', ':', '|', '\t']` or it might be a character that is blacklisted by default (e.g., alphanumeric, period).fixExplicitly provide a `whitelist` parameter with the characters you expect to be delimiters (e.g., `detect(text, whitelist=['|', '~'])`) or adjust the `blacklist` if characters are being incorrectly ignored (e.g., `detect(text, blacklist=[])`). -
Incorrect delimiter detected (e.g., returns ',' but file is ';'-separated).
cause The library's frequency-based detection can be misled if a character that is *not* the true delimiter appears more often in the sample text. This is common if data fields contain frequent commas in a semicolon-delimited file.fixNarrow down the possibilities using the `whitelist` parameter, for example, `detect(text, whitelist=[';', '|'])`. For highly ambiguous cases, manual inspection or a more context-aware parsing library might be needed.
Warnings
- gotcha The `detect()` function, by default, will not check alphanumeric characters or the period/full stop character ('.') as delimiters. If your files use these as actual delimiters (e.g., a custom file format with `.` as a separator), they will be ignored.
- gotcha The library does not handle CSV quoting rules (e.g., delimiters within double quotes `"field, with, commas"`). It primarily relies on simple character frequency counting. This can lead to incorrect delimiter detection in malformed CSVs or when data fields contain characters that are also common delimiters.
- gotcha The `detect-delimiter` library is designed for single-character delimiters and does not support multi-character delimiters (e.g., `##`, `|||`).
Install
-
pip install detect-delimiter
Imports
- detect
from detect_delimiter import detect
Quickstart
from detect_delimiter import detect
# Example 1: Basic comma-separated data
text1 = "apple,banana,cherry"
delimiter1 = detect(text1)
print(f"Delimiter for '{text1}': '{delimiter1}'")
# Example 2: Tab-separated data
text2 = "name\tage\tcity"
delimiter2 = detect(text2)
print(f"Delimiter for '{text2}': '{delimiter2}'")
# Example 3: Semicolon-separated with a custom whitelist
text3 = "one;two;three"
delimiter3 = detect(text3, whitelist=[';', ',', '|'])
print(f"Delimiter for '{text3}': '{delimiter3}'")
# Example 4: No common delimiter found, returning a default value
text4 = "hello world"
delimiter4 = detect(text4, default='NA')
print(f"Delimiter for '{text4}': '{delimiter4}'")
# Example 5: Period as delimiter, which is blacklisted by default
text5 = "file.name.txt"
delimiter5 = detect(text5)
print(f"Delimiter for '{text5}': '{delimiter5}'") # Expected: None (as '.' is blacklisted by default)