pyfastx

raw JSON →
2.3.0 verified Sat May 09 auth: no python

pyfastx is a Python module for fast random access to sequences from plain and gzipped FASTA/Q files. It provides an efficient, low-memory interface for reading, indexing, and querying biological sequences. Current version: 2.3.0, release cadence is irregular with updates about every 6-12 months.

pip install pyfastx
error AttributeError: 'Fastx' object has no attribute 'seq_len'
cause Using deprecated attribute seq_len which was removed in v2.0.0.
fix
Use len(seq) or seq.len instead.
error FileNotFoundError: [Errno 2] No such file or directory: 'example.fa.fxi'
cause pyfastx creates an index file (with .fxi extension) and expects write permissions in the same directory as the FASTA file. If the directory is read-only, this fails.
fix
Use index_dir parameter to specify a writable directory: pyfastx.Fasta('path/to/readonly/file.fa', index_dir='/tmp')
error ValueError: Sequence not found: chr1
cause The sequence ID used for random access does not match the actual header in the file. pyfastx uses the first word of the FASTA header (until whitespace) as the ID.
fix
Check the actual header: fa.keys() or iterate to see IDs. Use the exact first word of the header (e.g., for '>chr1 some description', the ID is 'chr1').
breaking In version 2.0.0, the default index file path changed. Indexes are now saved in a default location (e.g., ~/.pyfastx/index) unless specified. This can break scripts that rely on custom index locations or that expect indexes in the same directory.
fix Use index_dir parameter when calling Fasta or Fastq constructor to specify a custom directory: pyfastx.Fasta('file.fa', index_dir='./index')
gotcha When iterating over a Fastx object, breaking out of the loop early may cause indexing errors in subsequent random access. This is because iteration and index-based access share internal state.
fix Do not mix iteration and random access. If you need both, create separate instances: one for iteration, one for indexed access.
deprecated The 'seq_len' attribute on Sequence objects was renamed to 'len' in version 2.0.0. 'seq_len' is now deprecated.
fix Use len(seq) or seq.len instead of seq.seq_len.

Basic usage: open a FASTA or FASTQ file, iterate, access by ID.

import pyfastx

# Open a FASTA file (autodetects gzipped)
fa = pyfastx.Fasta('example.fasta')

# Get sequence count
print(fa.size)  # number of sequences

# Random access by sequence ID
seq = fa['chr1']
print(seq.seq[:10])  # first 10 bases

# Iterate over all sequences
for seq in fa:
    print(seq.id, len(seq))

# For FASTQ:
fq = pyfastx.Fastq('example.fastq')
for read in fq:
    print(read.id, read.qual)