pyfaidx: Efficient FASTA Access
pyfaidx is a Python library that provides efficient, pythonic random access to subsequences within FASTA files, compatible with samtools index format (.fai). It allows for fast retrieval and in-place modification without loading the entire file into memory. The current version is 0.9.0.4, with frequent minor updates and bug fixes.
Warnings
- breaking A bug in the new BGZF indexing strategy introduced in v0.9.0 affected versions up to v0.9.0.2. Users of pyfaidx v0.9.0, v0.9.0.1, or v0.9.0.2 should upgrade to v0.9.0.3 or later to avoid potential indexing issues with BGZF compressed FASTA files.
- gotcha pyfaidx uses 1-based (closed) coordinates for sequence attributes like `.start` and `.end` on `Sequence` objects, mirroring samtools faidx. However, Python's native slicing (`sequence[start:end]`) remains 0-based and half-open. This can be a common source of off-by-one errors if not carefully managed. You can initialize `Fasta(..., one_based_attributes=False)` to change the `.start/.end` attributes, but it won't affect slicing behavior.
- gotcha FASTA files require consistent line lengths (apart from the last line of a sequence) for pyfaidx to correctly build an index and retrieve subsequences. Inconsistent line lengths can lead to errors during indexing or incorrect sequence retrieval.
- gotcha By default, pyfaidx truncates sequence descriptions when indexing to keep names concise. If you need to access the full FASTA header (including the description) for each sequence, you must initialize the `Fasta` object with `read_long_names=True`. This option only works with uncompressed FASTA files.
Install
-
pip install pyfaidx
Imports
- Fasta
from pyfaidx import Fasta
Quickstart
import os
from pyfaidx import Fasta
# Create a dummy FASTA file for demonstration
fasta_content = (
">chr1 description of chromosome 1\n"
"ATGCGTACGTACGTACGTAGCTAGCTAGCTACGTAGCTACGTAGCTAGCTACGTACGT\n"
"CGTAGCTACGTAGCTACGTAGCTACGTAGCTAGCTACGTAGCTACGTACGTAGCTACG\n"
">chr2 description of chromosome 2\n"
"GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGA\n"
"TTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATT\n"
)
file_path = "example.fasta"
with open(file_path, "w") as f:
f.write(fasta_content)
# Open the FASTA file, an index (.fai) will be created if it doesn't exist
genes = Fasta(file_path)
# Access a sequence by its header name (case-sensitive)
chr1_sequence = genes['chr1']
print(f"Chromosome 1 full length: {len(chr1_sequence)}")
print(f"First 10 bases of chr1: {chr1_sequence[:10]}") # Pythonic 0-based slicing
# Retrieve a subsequence using 1-based coordinates (like samtools faidx)
# pyfaidx object slicing is 0-based, but Sequence object attributes are 1-based.
# To get 1-based subsequence string, you'd typically use the string slice directly from the FastaRecord.
# The example below shows how to get a 1-based range (e.g. for printing a 1-based output)
# Note: Slicing `genes['chr1'][start_0_based:end_0_based]`
# For 1-based '21-30', it means Python slice `[20:30]`
sub_sequence_1_based = genes['chr1'][20:30] # This gets bases 21-30 (1-based)
print(f"chr1 1-based coord 21-30: {sub_sequence_1_based.seq}")
print(f" .start (1-based): {sub_sequence_1_based.start}")
print(f" .end (1-based): {sub_sequence_1_based.end}")
# Get the reverse complement of a sequence
rc_sequence = genes['chr2'][::-1].complement
print(f"Reverse complement of chr2 start: {rc_sequence.seq[:20]}")
# Clean up the dummy file and its index
os.remove(file_path)
os.remove(file_path + ".fai")