Mooncake Transfer Engine
Mooncake Transfer Engine is a Python binding (using pybind11) for the core data transfer component of the Mooncake project. Mooncake itself is a KVCache-centric disaggregated architecture designed to optimize Large Language Model (LLM) inference. The Transfer Engine provides a high-performance, unified interface for batched data movement across various storage devices and network links, supporting protocols like TCP, RDMA, CXL/shared-memory, and NVMe over Fabric. It is actively maintained with frequent updates and integrations into LLM serving frameworks like SGLang and vLLM.
Common errors
-
No matched device found
cause Incorrect network interface card (NIC) names in configuration (e.g., `nic_priority_matrix`) or no active RDMA devices detected on the machine.fixVerify NIC names using `ibv_devinfo` and ensure they exist and are correctly configured. Confirm RDMA devices are active and properly initialized. -
Failed to create QP: Cannot allocate memory
cause Too many Queue Pairs (QP) have been created, hitting the driver's limit. This can be exacerbated by resource leaks from applications crashing or being killed without releasing RDMA resources.fixUpdate `mooncake-transfer-engine` to version `0.3.5` or later. Set the environment variable `MC_ENABLE_DEST_DEVICE_AFFINITY=1` before starting the application to optimize QP allocation. -
tcp transfer engine does not support transferring GPU memory
cause Attempting to transfer GPU memory over TCP when the Mooncake Transfer Engine was not built with CUDA support enabled, or the underlying TCP transport explicitly does not support direct GPU memory access.fixEnsure Mooncake is built with `USE_CUDA=ON` even if planning to use TCP for GPU memory transfers. For optimal GPU memory transfer, utilize RDMA protocol (`protocol='rdma'`) and ensure GPUDirect RDMA is configured. -
Worker: Process failed for slice
cause Indicates an issue during RDMA transfer, often due to network instability, configuration errors in `rdma_transport/rdma_*.cpp`, or the RDMA driver setting the connection to an unavailable state.fixTroubleshoot network stability and RDMA device status. Review `MC_TRANSFER_TIMEOUT` environment variable. The Transfer Engine attempts path reselection, but persistent issues require deeper network diagnostics. Examine accompanying error messages for specific clues.
Warnings
- gotcha The `mooncake-transfer-engine` package for CUDA-enabled systems requires CUDA 12.1+ during installation and runtime. For environments without CUDA, use `mooncake-transfer-engine-non-cuda`.
- gotcha When using RDMA protocol, proper kernel modules (like `nvidia_peermem` for NVIDIA GPUs) and permissions (often requiring `sudo`) are necessary. Issues with `nvidia_peermem` can cause RDMA failures.
- breaking Maintaining strict version consistency of the Transfer Engine between Mooncake itself and integrated inference engines (e.g., SGLang Serving Backend) is crucial for KVCache transport protocol compatibility. Incompatible versions can lead to transfer failures.
- gotcha Using batch transfer APIs, particularly in multi-node NVLink transfers, has been observed to sometimes affect accuracy in a few inference engines and benchmarks.
Install
-
pip install mooncake-transfer-engine -
pip install mooncake-transfer-engine-non-cuda
Imports
- TransferEngine
from mooncake.engine import TransferEngine
- TransferNotify
from mooncake.engine import TransferNotify
- TransferOpcode
from mooncake.engine import TransferOpcode
Quickstart
import numpy as np
import os
# In a real distributed setup, a metadata server (e.g., etcd) would be used.
# For a simple local demo, 'P2PHANDSHAKE' can be used.
METADATA_SERVER = os.environ.get('MC_METADATA_SERVER', 'P2PHANDSHAKE')
LOCAL_HOSTNAME = os.environ.get('MC_LOCAL_HOSTNAME', '127.0.0.1:12345')
PROTOCOL = os.environ.get('MC_PROTOCOL', 'tcp') # Use 'rdma' for RDMA-capable networks
DEVICE_NAME = os.environ.get('MC_DEVICE_NAME', '') # Auto discovery if empty
try:
from mooncake.engine import TransferEngine
# Create transfer engine instance
engine = TransferEngine()
# Initialize with basic configuration
# In a real scenario, local_hostname would be the actual server IP/port
# and metadata_server would point to the etcd cluster or similar.
engine.initialize(
LOCAL_HOSTNAME,
METADATA_SERVER,
PROTOCOL,
DEVICE_NAME
)
# Allocate and initialize a buffer (e.g., 1MB)
# Note: For GPU memory, specific allocation methods/context would be needed.
client_buffer = np.zeros(1024 * 1024, dtype=np.uint8)
buffer_address = client_buffer.ctypes.data
buffer_length = client_buffer.nbytes
print(f"TransferEngine initialized on {LOCAL_HOSTNAME} with {PROTOCOL} protocol.")
print(f"Allocated buffer at address: {buffer_address}, length: {buffer_length} bytes.")
# Example: Register memory (optional, depending on protocol/usage)
# engine.register_memory(buffer_address, buffer_length)
# In a full setup, you would then perform transfer operations
# e.g., engine.transfer_sync_write(target_hostname, buffer_address, peer_buffer_address, buffer_length)
print("Mooncake Transfer Engine basic setup successful (no actual transfer performed).")
except ImportError:
print("mooncake-transfer-engine not installed or could not be imported.")
print("Please ensure you installed the correct version for your CUDA environment.")
except Exception as e:
print(f"An error occurred: {e}")