{"id":7429,"library":"mooncake-transfer-engine","title":"Mooncake Transfer Engine","description":"Mooncake Transfer Engine is a Python binding (using pybind11) for the core data transfer component of the Mooncake project. Mooncake itself is a KVCache-centric disaggregated architecture designed to optimize Large Language Model (LLM) inference. The Transfer Engine provides a high-performance, unified interface for batched data movement across various storage devices and network links, supporting protocols like TCP, RDMA, CXL/shared-memory, and NVMe over Fabric. It is actively maintained with frequent updates and integrations into LLM serving frameworks like SGLang and vLLM.","status":"active","version":"0.3.10.post1","language":"en","source_language":"en","source_url":"https://github.com/kvcache-ai/Mooncake","tags":["AI/ML","LLM","distributed systems","high-performance computing","RDMA","data transfer","GPU","kv-cache","inference acceleration"],"install":[{"cmd":"pip install mooncake-transfer-engine","lang":"bash","label":"For CUDA-enabled systems (requires CUDA 12.1+)"},{"cmd":"pip install mooncake-transfer-engine-non-cuda","lang":"bash","label":"For non-CUDA systems"}],"dependencies":[{"reason":"Commonly used for buffer allocation in Python examples.","package":"numpy","optional":false},{"reason":"Used in quickstart examples for inter-process communication.","package":"zmq","optional":true},{"reason":"Often used as a metadata server backend for cluster coordination.","package":"etcd","optional":true}],"imports":[{"symbol":"TransferEngine","correct":"from mooncake.engine import TransferEngine"},{"symbol":"TransferNotify","correct":"from mooncake.engine import TransferNotify"},{"symbol":"TransferOpcode","correct":"from mooncake.engine import TransferOpcode"}],"quickstart":{"code":"import numpy as np\nimport os\n# In a real distributed setup, a metadata server (e.g., etcd) would be used.\n# For a simple local demo, 'P2PHANDSHAKE' can be used.\nMETADATA_SERVER = os.environ.get('MC_METADATA_SERVER', 'P2PHANDSHAKE')\nLOCAL_HOSTNAME = os.environ.get('MC_LOCAL_HOSTNAME', '127.0.0.1:12345')\nPROTOCOL = os.environ.get('MC_PROTOCOL', 'tcp') # Use 'rdma' for RDMA-capable networks\nDEVICE_NAME = os.environ.get('MC_DEVICE_NAME', '') # Auto discovery if empty\n\ntry:\n    from mooncake.engine import TransferEngine\n\n    # Create transfer engine instance\n    engine = TransferEngine()\n\n    # Initialize with basic configuration\n    # In a real scenario, local_hostname would be the actual server IP/port\n    # and metadata_server would point to the etcd cluster or similar.\n    engine.initialize(\n        LOCAL_HOSTNAME,\n        METADATA_SERVER,\n        PROTOCOL,\n        DEVICE_NAME\n    )\n\n    # Allocate and initialize a buffer (e.g., 1MB)\n    # Note: For GPU memory, specific allocation methods/context would be needed.\n    client_buffer = np.zeros(1024 * 1024, dtype=np.uint8)\n    buffer_address = client_buffer.ctypes.data\n    buffer_length = client_buffer.nbytes\n\n    print(f\"TransferEngine initialized on {LOCAL_HOSTNAME} with {PROTOCOL} protocol.\")\n    print(f\"Allocated buffer at address: {buffer_address}, length: {buffer_length} bytes.\")\n\n    # Example: Register memory (optional, depending on protocol/usage)\n    # engine.register_memory(buffer_address, buffer_length)\n\n    # In a full setup, you would then perform transfer operations\n    # e.g., engine.transfer_sync_write(target_hostname, buffer_address, peer_buffer_address, buffer_length)\n\n    print(\"Mooncake Transfer Engine basic setup successful (no actual transfer performed).\")\n\nexcept ImportError:\n    print(\"mooncake-transfer-engine not installed or could not be imported.\")\n    print(\"Please ensure you installed the correct version for your CUDA environment.\")\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")","lang":"python","description":"This quickstart demonstrates the basic initialization of the Mooncake Transfer Engine. It creates an instance of `TransferEngine` and initializes it with network configuration, then allocates a NumPy buffer. In a real distributed setting, you would typically run a receiver and a sender, with `METADATA_SERVER` pointing to an actual coordination service like etcd. The `protocol` should be set to 'rdma' for high-performance RDMA-capable networks."},"warnings":[{"fix":"Ensure your CUDA toolkit version is 12.1 or newer. If not, consider building from source with specific CUDA versions or install `mooncake-transfer-engine-non-cuda`.","message":"The `mooncake-transfer-engine` package for CUDA-enabled systems requires CUDA 12.1+ during installation and runtime. For environments without CUDA, use `mooncake-transfer-engine-non-cuda`.","severity":"gotcha","affected_versions":">=0.3.0"},{"fix":"Ensure RDMA drivers and `nvidia_peermem` are correctly installed and loaded. Run applications with `sudo` if permission errors persist. Consider `NIXL` as an alternative to `nvidia_peermem` if issues arise.","message":"When using RDMA protocol, proper kernel modules (like `nvidia_peermem` for NVIDIA GPUs) and permissions (often requiring `sudo`) are necessary. Issues with `nvidia_peermem` can cause RDMA failures.","severity":"gotcha","affected_versions":"All"},{"fix":"When upgrading an inference engine that integrates Mooncake, ensure the `mooncake-transfer-engine` package is updated simultaneously to maintain compatibility.","message":"Maintaining strict version consistency of the Transfer Engine between Mooncake itself and integrated inference engines (e.g., SGLang Serving Backend) is crucial for KVCache transport protocol compatibility. Incompatible versions can lead to transfer failures.","severity":"breaking","affected_versions":"All"},{"fix":"Monitor accuracy closely when deploying applications using batch transfer APIs with `mooncake-transfer-engine` in multi-node NVLink configurations.","message":"Using batch transfer APIs, particularly in multi-node NVLink transfers, has been observed to sometimes affect accuracy in a few inference engines and benchmarks.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Verify NIC names using `ibv_devinfo` and ensure they exist and are correctly configured. Confirm RDMA devices are active and properly initialized.","cause":"Incorrect network interface card (NIC) names in configuration (e.g., `nic_priority_matrix`) or no active RDMA devices detected on the machine.","error":"No matched device found"},{"fix":"Update `mooncake-transfer-engine` to version `0.3.5` or later. Set the environment variable `MC_ENABLE_DEST_DEVICE_AFFINITY=1` before starting the application to optimize QP allocation.","cause":"Too many Queue Pairs (QP) have been created, hitting the driver's limit. This can be exacerbated by resource leaks from applications crashing or being killed without releasing RDMA resources.","error":"Failed to create QP: Cannot allocate memory"},{"fix":"Ensure Mooncake is built with `USE_CUDA=ON` even if planning to use TCP for GPU memory transfers. For optimal GPU memory transfer, utilize RDMA protocol (`protocol='rdma'`) and ensure GPUDirect RDMA is configured.","cause":"Attempting to transfer GPU memory over TCP when the Mooncake Transfer Engine was not built with CUDA support enabled, or the underlying TCP transport explicitly does not support direct GPU memory access.","error":"tcp transfer engine does not support transferring GPU memory"},{"fix":"Troubleshoot network stability and RDMA device status. Review `MC_TRANSFER_TIMEOUT` environment variable. The Transfer Engine attempts path reselection, but persistent issues require deeper network diagnostics. Examine accompanying error messages for specific clues.","cause":"Indicates an issue during RDMA transfer, often due to network instability, configuration errors in `rdma_transport/rdma_*.cpp`, or the RDMA driver setting the connection to an unavailable state.","error":"Worker: Process failed for slice"}]}