lmcache
lmcache is a Python library that provides an LLM serving engine extension. It aims to reduce Time To First Token (TTFT) and increase throughput, particularly in scenarios involving long contexts. The current version is 0.4.3, and it appears to have an active development cadence.
Common errors
-
ConnectionRefusedError: [Errno 111] Connection refused
cause The lmcache server is not running or the client is trying to connect to the wrong host/port.fixStart the lmcache server (e.g., `lmcache serve`) and ensure the client's `host` and `port` parameters match the server's configuration. -
AttributeError: 'Client' object has no attribute 'complete'
cause You are using an older API method (e.g., `complete`) with a `lmcache` client version 0.4.0 or newer.fixUpdate your client code to use the new OpenAI-compatible API, specifically `client.chat_completion()` and related schema objects. -
ModuleNotFoundError: No module named 'lmcache.core.client'
cause You are trying to import the `Client` class from an old module path.fixChange the import statement from `from lmcache.core.client import Client` to `from lmcache.client import Client`.
Warnings
- breaking The client-side API underwent a significant refactor in version 0.4.0 to align more closely with the OpenAI API. Code written for versions prior to 0.4.0 will likely be incompatible.
- gotcha lmcache is a client-server architecture. The client library cannot function without a separate lmcache server instance running. A common error is a 'Connection Refused' message if the server is not started or is inaccessible.
- gotcha The lmcache server (and thus, the library's utility) often requires significant GPU memory and computational resources, especially for large language models. Insufficient resources can lead to performance issues or failures.
- gotcha Model compatibility and configuration can be tricky. The client's `model` parameter must correspond to a model successfully loaded and served by the lmcache server, which might require specific server configurations or local model files.
Install
-
pip install lmcache
Imports
- Client
from lmcache.core.client import Client
from lmcache.client import Client
- ChatCompletionRequest
from lmcache.schemas import ChatCompletionRequest
- ChatCompletionMessage
from lmcache.schemas import ChatCompletionMessage
Quickstart
import os
from lmcache.client import Client
from lmcache.schemas import ChatCompletionRequest, ChatCompletionMessage
# NOTE: An lmcache server must be running separately for this client to connect.
# Default server host is 'localhost', port 13333.
try:
client = Client(host=os.environ.get('LMCACHE_HOST', 'localhost'),
port=int(os.environ.get('LMCACHE_PORT', 13333)))
request = ChatCompletionRequest(
model=os.environ.get('LMCACHE_MODEL', 'gpt-3.5-turbo'), # Replace with a model supported by your lmcache server
messages=[
ChatCompletionMessage(role="user", content="Hello, how are you?"),
ChatCompletionMessage(role="assistant", content="I am doing well, thank you!"),
ChatCompletionMessage(role="user", content="What is your purpose?")
]
)
response = client.chat_completion(request)
print(f"Assistant: {response.choices[0].message.content}")
except Exception as e:
print(f"An error occurred: {e}")
print("Ensure the lmcache server is running and accessible at the specified host and port.")