cyvcf2: Fast VCF Parsing with Cython + HTSlib
cyvcf2 is a Python library providing a fast Cython wrapper for HTSlib, specifically designed for efficient parsing, querying, and limited modification of VCF (Variant Call Format) and BCF files. It offers a Python-friendly interface to access genetic variation data, supporting quick iteration through variants, extraction of diverse variant attributes, and manipulation of INFO and FORMAT fields. The library is highly optimized for performance, making it suitable for processing large genomic datasets. [1, 3, 4, 8]
Common errors
-
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 1: ordinal not in range(128)
cause This often occurs when VCF files contain non-ASCII characters in fields that cyvcf2 tries to interpret as ASCII, especially with older Python 3 environments or specific system locales. It can also happen with corrupted or malformed VCF entries. [12]fixEnsure your VCF files are properly encoded. If working with older Python versions, ensure locale settings are correct. Check for non-standard characters in your VCF. For some older versions, the `v.INFO` keys might be bytes in Python 3, requiring explicit decoding (e.g., `key.decode('utf-8')`). -
ImportError: cannot import name 'VCF' from 'cyvcf2'
cause This typically means cyvcf2 was not installed correctly, or there's a naming conflict with another `cyvcf2.py` file or directory in your Python path, preventing the actual library from being loaded. It can also occur if the installation failed due to missing C dependencies (like htslib).fixVerify installation with `pip list | grep cyvcf2`. Check for any local files or directories named `cyvcf2.py` or `cyvcf2` that might shadow the installed package. If compiling from source, ensure HTSlib and its development headers are available on your system. -
Can't install from source / compile errors (e.g., on Windows with Python 3.7+)
cause Installation on Windows, especially for Python versions 3.7 and above, often requires specific Visual C++ Build Tools (MSVC v14.0 or newer) and can encounter compatibility issues with Cython-generated code due to changes in Python's internal APIs. [14]fixInstall the required Visual C++ Build Tools (e.g., 'Desktop development with C++' workload in Visual Studio Installer). Consider using `conda` for easier installation on Windows, as `bioconda` often provides pre-compiled binaries that bypass local compilation challenges. Alternatively, use a Linux environment or WSL. -
PackagesNotFoundError: The following packages are not available from current channels: - cyvcf2 (when using `conda install`)
cause The default conda channels do not contain `cyvcf2`. It is primarily hosted on the `bioconda` channel. [16]fixAdd the bioconda channel to your conda configuration: `conda config --add channels bioconda` and then `conda install cyvcf2`. Ensure `conda-forge` is also enabled: `conda config --add channels conda-forge`.
Warnings
- breaking HTSlib version compatibility changed significantly. cyvcf2 versions < 0.20.0 require htslib < 1.10, while cyvcf2 versions >= 0.20.0 require htslib >= 1.10. Installing with an incompatible htslib version will lead to build or runtime errors. [3]
- gotcha Numpy arrays returned by `variant.gt_types`, `variant.gt_ref_depths`, etc., are views into the underlying C data structure. These arrays become invalid (containing 'nonsense' data) once the `variant` object goes out of scope. [3]
- gotcha cyvcf2 does not support writing VCFs with UTF-8 encoded, non-ASCII characters in string-typed FORMAT fields, nor does it support writing string type FORMAT fields with `Number` greater than 1. [1, 15]
- gotcha By default, cyvcf2 classifies partially missing genotypes (e.g., `0/.`, `./1`) as heterozygous (HET). This can be inconsistent with how some other tools might interpret them (e.g., UNKNOWN). [1]
Install
-
pip install cyvcf2 -
conda install -c bioconda cyvcf2
Imports
- VCF
from cyvcf2 import VCF
- Writer
from cyvcf2 import VCF, Writer
Quickstart
import os
from cyvcf2 import VCF
# Create a dummy VCF file for demonstration if it doesn't exist
vcf_path = 'example.vcf'
if not os.path.exists(vcf_path):
with open(vcf_path, 'w') as f:
f.write('##fileformat=VCFv4.2\n')
f.write('##CHROM=<ID=1,length=10000>\n')
f.write('##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">\n')
f.write('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">\n')
f.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE1\tSAMPLE2\n')
f.write('1\t100\trs1\tA\tT\t50\tPASS\tDP=100\tGT\t0/1\t1/1\n')
f.write('1\t200\trs2\tC\tG,T\t90\tPASS\tDP=150\tGT\t0/0\t0/1\n')
try:
vcf = VCF(vcf_path)
for variant in vcf:
print(f"CHROM: {variant.CHROM}, POS: {variant.POS}, REF: {variant.REF}, ALT: {variant.ALT}")
print(f" ID: {variant.ID}, QUAL: {variant.QUAL}, FILTER: {variant.FILTER}")
print(f" INFO DP: {variant.INFO.get('DP')}")
# gt_types: 0=HOM_REF, 1=HET, 2=UNKNOWN, 3=HOM_ALT
print(f" Genotypes (types): {variant.gt_types}")
print(f" Reference depths: {variant.gt_ref_depths}")
print(f" Alternate depths: {variant.gt_alt_depths}")
vcf.close()
except Exception as e:
print(f"Error processing VCF: {e}")
print("Please ensure 'example.vcf' is a valid VCF file and indexed if doing region queries.")