lmcache

0.4.3 · active · verified Thu Apr 16

lmcache is a Python library that provides an LLM serving engine extension. It aims to reduce Time To First Token (TTFT) and increase throughput, particularly in scenarios involving long contexts. The current version is 0.4.3, and it appears to have an active development cadence.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use the lmcache client to interact with a running lmcache server. It sends a chat completion request similar to the OpenAI API. Please ensure that an lmcache server is running independently before executing this client code.

import os
from lmcache.client import Client
from lmcache.schemas import ChatCompletionRequest, ChatCompletionMessage

# NOTE: An lmcache server must be running separately for this client to connect.
# Default server host is 'localhost', port 13333.

try:
    client = Client(host=os.environ.get('LMCACHE_HOST', 'localhost'), 
                    port=int(os.environ.get('LMCACHE_PORT', 13333)))

    request = ChatCompletionRequest(
        model=os.environ.get('LMCACHE_MODEL', 'gpt-3.5-turbo'), # Replace with a model supported by your lmcache server
        messages=[
            ChatCompletionMessage(role="user", content="Hello, how are you?"),
            ChatCompletionMessage(role="assistant", content="I am doing well, thank you!"),
            ChatCompletionMessage(role="user", content="What is your purpose?")
        ]
    )

    response = client.chat_completion(request)
    print(f"Assistant: {response.choices[0].message.content}")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Ensure the lmcache server is running and accessible at the specified host and port.")

view raw JSON →