fa3-fwd

0.0.3 verified Mon Apr 27 auth: no python

fa3-fwd provides a forward-only implementation of FlashAttention-3 for efficient attention computation on GPUs. Version 0.0.3, pre-release, no stable release cadence.

pip install fa3-fwd

Common errors

error ModuleNotFoundError: No module named 'fa3_fwd' ↓

cause Wrong import path, missing install, or Python environment issue.

fix

Run 'pip install fa3-fwd' and use 'import fa3_fwd' (underscore, not hyphen).

error RuntimeError: FlashAttention only supported on CUDA ↓

cause Tensors are on CPU instead of GPU.

fix

Move tensors to CUDA: q = q.cuda(), etc.

error TypeError: flash_attn_forward() missing 3 required positional arguments: 'q', 'k', 'v' ↓

cause Missing required keyword arguments or too few positional args.

fix

Call with three tensors: flash_attn_forward(q, k, v).

Warnings

breaking Only forward pass is implemented; no backward pass. Cannot be used for training. ↓

fix Use full FlashAttention-3 library if backward is needed.

deprecated The API is experimental and may change without notice in future versions. ↓

fix Pin version if stability is required.

gotcha Requires CUDA-capable GPU and PyTorch with CUDA. Will raise RuntimeError on CPU. ↓

fix Ensure tensors are on CUDA device.

Imports

flash_attn_forward
wrong
```
from fa3fwd import flash_attn_forward
```
correct
```
from fa3_fwd import flash_attn_forward
```
Package uses underscores in module name, not hyphen.

Quickstart

Basic usage of flash attention forward pass.

import torch
from fa3_fwd import flash_attn_forward

q = torch.randn(1, 8, 64, 128, device='cuda', dtype=torch.bfloat16)
k = torch.randn(1, 8, 64, 128, device='cuda', dtype=torch.bfloat16)
v = torch.randn(1, 8, 64, 128, device='cuda', dtype=torch.bfloat16)

out = flash_attn_forward(q, k, v)
print(out.shape)