pyfaidx: Efficient FASTA Access

0.9.0.4 · active · verified Sat Apr 11

pyfaidx is a Python library that provides efficient, pythonic random access to subsequences within FASTA files, compatible with samtools index format (.fai). It allows for fast retrieval and in-place modification without loading the entire file into memory. The current version is 0.9.0.4, with frequent minor updates and bug fixes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize a `Fasta` object, access sequences by their header names, retrieve subsequences using slicing, and perform operations like reverse complementation. It also highlights the default 1-based indexing for sequence attributes, while slicing remains 0-based Pythonic.

import os
from pyfaidx import Fasta

# Create a dummy FASTA file for demonstration
fasta_content = (
    ">chr1 description of chromosome 1\n"
    "ATGCGTACGTACGTACGTAGCTAGCTAGCTACGTAGCTACGTAGCTAGCTACGTACGT\n"
    "CGTAGCTACGTAGCTACGTAGCTACGTAGCTAGCTACGTAGCTACGTACGTAGCTACG\n"
    ">chr2 description of chromosome 2\n"
    "GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGA\n"
    "TTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATT\n"
)

file_path = "example.fasta"
with open(file_path, "w") as f:
    f.write(fasta_content)

# Open the FASTA file, an index (.fai) will be created if it doesn't exist
genes = Fasta(file_path)

# Access a sequence by its header name (case-sensitive)
chr1_sequence = genes['chr1']
print(f"Chromosome 1 full length: {len(chr1_sequence)}")
print(f"First 10 bases of chr1: {chr1_sequence[:10]}") # Pythonic 0-based slicing

# Retrieve a subsequence using 1-based coordinates (like samtools faidx)
# pyfaidx object slicing is 0-based, but Sequence object attributes are 1-based.
# To get 1-based subsequence string, you'd typically use the string slice directly from the FastaRecord.
# The example below shows how to get a 1-based range (e.g. for printing a 1-based output)
# Note: Slicing `genes['chr1'][start_0_based:end_0_based]`
# For 1-based '21-30', it means Python slice `[20:30]`
sub_sequence_1_based = genes['chr1'][20:30] # This gets bases 21-30 (1-based)
print(f"chr1 1-based coord 21-30: {sub_sequence_1_based.seq}")
print(f"  .start (1-based): {sub_sequence_1_based.start}")
print(f"  .end (1-based): {sub_sequence_1_based.end}")

# Get the reverse complement of a sequence
rc_sequence = genes['chr2'][::-1].complement
print(f"Reverse complement of chr2 start: {rc_sequence.seq[:20]}")

# Clean up the dummy file and its index
os.remove(file_path)
os.remove(file_path + ".fai")

view raw JSON →