This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine.
The following table provide an overview of the current implementations / features:
| implementations / features | xllamacpp | llama-cpp-python |
|---|---|---|
| Wrapper-type | cython | ctypes |
| API | Server & Params API | Llama API |
| Server implementation | C++ | Python through wrapped LLama API |
| Continuous batching | yes | no |
| Thread safe | yes | no |
| Release package | prebuilt | build during installation |
It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!
As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:
-
In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.
-
Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.
-
Minimize non-wrapper python code.
Note on Performance and Compatibility
⚠️ The pre-built Linux/Windows CPU-only release wheels are NOT fully optimized. They are compiled for maximum compatibility across a wide range of hardware, not for maximum performance. CPU-native optimizations (e.g., AVX-512, SVE, AMX) are disabled in these builds to ensure they run on as many machines as possible.✅ macOS wheels are fully optimized. Since each macOS architecture (Apple Silicon / Intel) uses a consistent CPU, native optimizations are enabled in release builds.
✅ GPU-accelerated wheels (CUDA / ROCm / Vulkan) are already optimized for GPU computation. The heavy lifting is done on the GPU, so the lack of CPU-native optimizations has minimal impact on overall performance.
If you are on Linux/Windows using CPU-only inference and want the best performance, build from source. When building locally, all native CPU optimizations are automatically enabled for your specific machine. See Build from Source for instructions.
Specifically:
- macOS (Apple Silicon & Intel): wheels are built with native CPU optimizations enabled — no action needed.
- CUDA / ROCm / Vulkan wheels: GPU acceleration is fully optimized; CPU-native optimizations are not critical since inference runs primarily on the GPU.
- Linux x86_64 CPU-only wheels disable
-march=native, so advanced instruction sets like AVX-512 or AMX are not used. - Linux aarch64 CPU-only wheels target the baseline
armv8-aarchitecture, so advanced features like SVE are not enabled. - Windows x86_64 CPU-only wheels disable
-march=native, similar to Linux x86_64.
If you are on Linux/Windows using CPU-only inference and your CPU supports these advanced features, building from source can provide significantly better performance.
- From pypi for
CPUorMac:
pip install -U xllamacpp-
From github pypi for
CUDA(use--force-reinstallto replace the installed CPU version):-
CUDA 12.4
pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu124
-
CUDA 12.8
pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu128
-
-
From github pypi for
HIPAMD GPU (use--force-reinstallto replace the installed CPU version):-
ROCm 6.3.4
pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/rocm-6.3.4
-
ROCm 6.4.1
pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/rocm-6.4.1
-
-
From github pypi for
Vulkan(use--force-reinstallto replace the installed CPU version):pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/vulkan
Before pip installing xllamacpp, please ensure your system meets the following requirements based on your build type:
-
CPU (macOS):
- Pre-built wheels are already optimized for native CPU — no additional steps needed
-
CPU (Linux aarch64):
- Requires ARMv8-A or later architecture
- For best performance, build from source if your CPU supports advanced instruction sets (e.g., SVE)
-
CUDA (Linux):
- Requires glibc 2.35 or later
- Compatible NVIDIA GPU with appropriate drivers (CUDA 12.4 or 12.8)
-
ROCm (Linux):
- Requires glibc 2.35 or later
- Requires gcc 10 or later (ROCm libraries have this dependency)
- Compatible AMD GPU with ROCm support (ROCm 6.3.4 or 6.4.1)
-
Vulkan (Linux/Windows, Intel/AMD/NVIDIA where supported):
- Install the Vulkan SDK and GPU drivers with Vulkan support
- Linux users may need distro packages and the LunarG SDK
- macOS Intel is supported via Vulkan; Apple Silicon Vulkan is not supported in this project
-
CUDA
This provides GPU acceleration using an NVIDIA GPU. Make sure to have the CUDA toolkit installed.
You may find the official downloads here: NVIDIA developer site.
We also have a guide for setting up CUDA toolkit in a Fedora toolbox container.
Recommended for:
- Necessary for users of Atomic Desktops for Fedora; such as: Silverblue and Kinoite.
- (there are no supported CUDA packages for these systems)
- Necessary for users that have a host that is not a: Supported Nvidia CUDA Release Platform.
- (for example, you may have Fedora 42 Beta as your your host operating system)
- Convenient For those running Fedora Workstation or Fedora KDE Plasma Desktop, and want to keep their host system clean.
- Optionally toolbox packages are available: Arch Linux, Red Hat Enterprise Linux >= 8.5, or Ubuntu
- Necessary for users of Atomic Desktops for Fedora; such as: Silverblue and Kinoite.
-
HIP
This provides GPU acceleration on HIP-supported AMD GPUs. Make sure to have ROCm installed. You can download it from your Linux distro's package manager or from here: ROCm Quick Start (Linux).
Or you can try to build inside the ROCm docker container.
-
Vulkan
Install the Vulkan SDK and drivers for your platform.
- Linux: use your distro packages and/or the LunarG Vulkan SDK.
- Windows: install LunarG Vulkan SDK and vendor GPU drivers.
- macOS: Intel only; Apple Silicon is not supported for Vulkan in this project.
-
A recent version of
python3(testing on python 3.12) -
Install Rust toolchain (required for building):
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shFor more installation options, see the rustup installation guide.
- Git clone the latest version of
xllamacpp:
git clone git@github.com:xorbitsai/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update- Install dependencies of
cython,setuptools, andpytestfor testing:
pip install -r requirements.txt-
Select backend via environment and build. Examples:
-
CPU (default):
make
-
CUDA:
export XLLAMACPP_BUILD_CUDA=1 make -
HIP (AMD):
export XLLAMACPP_BUILD_HIP=1 make -
Vulkan:
export XLLAMACPP_BUILD_VULKAN=1 make -
Enable BLAS (optional):
export CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" make
-
All examples below assume you have xllamacpp installed and a GGUF model file available. See Install for installation instructions and Testing for how to download models.
CommonParams is the central configuration object. It controls model loading, inference, sampling, server behavior, and more. For a complete list of all parameters with types, defaults, and descriptions, see the Parameters Reference.
import xllamacpp as xlc
params = xlc.CommonParams()
# Model path
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
# Context and prediction
params.n_ctx = 512 # Context window size
params.n_predict = 128 # Max tokens to predict
params.n_batch = 2048 # Batch size for prompt processing
params.n_ubatch = 512 # Micro-batch size
params.n_parallel = 1 # Number of parallel sequences
# CPU threading
params.cpuparams.n_threads = 4
params.cpuparams_batch.n_threads = 4
# Sampling parameters
params.sampling.temp = 0.8
params.sampling.top_k = 40
params.sampling.top_p = 0.95
params.sampling.seed = 42
# Server options
params.warmup = True
params.endpoint_metrics = TrueGenerate text from a prompt using the completions API:
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
params.n_predict = 128
params.n_ctx = 256
params.cpuparams.n_threads = 4
server = xlc.Server(params)
# Non-streaming completion (returns a dict)
result = server.handle_completions({
"max_tokens": 128,
"prompt": "Write the fibonacci function in C++.",
})
print(result["choices"][0]["text"])Use the chat completions API with message-based conversations:
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
params.n_predict = 128
params.n_ctx = 256
params.cpuparams.n_threads = 4
server = xlc.Server(params)
result = server.handle_chat_completions({
"max_tokens": 128,
"messages": [
{"role": "system", "content": "You are a coding assistant."},
{"role": "user", "content": "Write the fibonacci function in C++."},
],
})
print(result["choices"][0]["message"]["content"])Enable streaming to receive tokens as they are generated. Provide a callback function:
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
params.n_predict = 64
params.n_ctx = 256
params.cpuparams.n_threads = 4
server = xlc.Server(params)
# Streaming completions with a callback
server.handle_completions(
{
"max_tokens": 128,
"prompt": "Tell me a story.",
"stream": True,
},
lambda chunk: print(chunk), # Called for each chunk
)
# Streaming chat completions
server.handle_chat_completions(
{
"max_tokens": 128,
"messages": [{"role": "user", "content": "Tell me a joke."}],
"stream": True,
},
lambda chunk: print(chunk),
)You can stop streaming early by returning True from the callback:
count = 0
def stop_after_one(chunk):
global count
count += 1
print(chunk)
return True # Returning True stops the stream
server.handle_completions(
{"prompt": "Write a long story.", "stream": True, "max_tokens": 128},
stop_after_one,
)
# Only one chunk will be receivedConstrain model output to a JSON schema using grammar:
import json
import xllamacpp as xlc
# Define a JSON schema
schema = {
"type": "object",
"properties": {
"answer": {"type": "string"},
"score": {"type": "number"},
},
"required": ["answer"],
}
# Convert schema to grammar string
grammar = xlc.json_schema_to_grammar(schema)
params = xlc.CommonParams()
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
params.n_predict = 64
params.n_ctx = 256
params.sampling.temp = 0
params.sampling.top_k = 1
params.sampling.grammar = grammar
server = xlc.Server(params)
result = server.handle_chat_completions({
"max_tokens": 64,
"messages": [
{"role": "system", "content": "Respond with a JSON object matching the provided schema."},
{"role": "user", "content": "Provide an answer string and an optional numeric score."},
],
})
content = result["choices"][0]["message"]["content"]
parsed = json.loads(content)
print(parsed) # e.g. {"answer": "Hello", "score": 42}Generate embeddings for a list of texts. Requires an embedding model (e.g., Qwen3-Embedding-0.6B-Q8_0.gguf):
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/Qwen3-Embedding-0.6B-Q8_0.gguf"
params.embedding = True
params.n_ctx = 512
params.n_batch = 128
params.n_ubatch = 128
params.pooling_type = xlc.llama_pooling_type.LLAMA_POOLING_TYPE_LAST
server = xlc.Server(params)
result = server.handle_embeddings({
"input": [
"I believe the meaning of life is",
"This is a test",
],
"model": "My Qwen3 Model",
})
for item in result["data"]:
print(f"Index {item['index']}: {len(item['embedding'])} dimensions")Rerank documents by relevance to a query. Requires a reranker model (e.g., bge-reranker-v2-m3-Q2_K.gguf):
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/bge-reranker-v2-m3-Q2_K.gguf"
params.embedding = True
params.n_ctx = 512
params.n_batch = 128
params.n_ubatch = 128
params.pooling_type = xlc.llama_pooling_type.LLAMA_POOLING_TYPE_RANK
server = xlc.Server(params)
result = server.handle_rerank({
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"The Eiffel Tower is in Paris.",
"Germany is located in Europe.",
],
})
for doc in result["results"]:
print(f"Document {doc['index']}: relevance_score={doc['relevance_score']:.4f}")Process images alongside text using a vision model. Requires a multimodal model and its mmproj file:
import base64
import xllamacpp as xlc
# Load and encode an image
with open("image.png", "rb") as f:
img_b64 = "data:image/png;base64," + base64.b64encode(f.read()).decode("utf-8")
params = xlc.CommonParams()
params.model.path = "models/tinygemma3-Q8_0.gguf"
params.mmproj.path = "models/mmproj-tinygemma3.gguf"
params.n_predict = 128
params.n_ctx = 1024
params.sampling.temp = 0
params.sampling.top_k = 1
server = xlc.Server(params)
result = server.handle_chat_completions({
"max_tokens": 128,
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is this:"},
{"type": "image_url", "image_url": {"url": img_b64}},
],
},
],
})
print(result["choices"][0]["message"]["content"])Load and apply LoRA adapters to customize model behavior:
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/stories15M_MOE-F16.gguf"
params.n_predict = 64
params.n_ctx = 256
params.sampling.seed = 42
params.sampling.temp = 0.0
params.sampling.top_k = 1
# Attach a LoRA adapter with a scale factor
lora = xlc.CommonAdapterLoraInfo("models/moe_shakespeare15M.gguf", 1.0)
params.lora_adapters = [lora]
# You can adjust the scale dynamically
params.lora_adapters[0].scale = 0.5 # Half strength
server = xlc.Server(params)
result = server.handle_completions({
"max_tokens": 64,
"prompt": "Look in thy glass",
"temperature": 0.0,
})
print(result["choices"][0]["text"])Control reasoning/thinking behavior for models that support it (e.g., DeepSeek, QwQ):
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/my-reasoning-model.gguf"
# Reasoning format: controls how thought tags are parsed
# COMMON_REASONING_FORMAT_DEEPSEEK (default) - puts thoughts in reasoning_content
# COMMON_REASONING_FORMAT_NONE - leaves thoughts unparsed in content
params.reasoning_format = xlc.common_reasoning_format.COMMON_REASONING_FORMAT_DEEPSEEK
# Enable/disable reasoning: -1=auto, 0=off, 1=on
params.enable_reasoning = 1
# Token budget for thinking: -1=unrestricted, 0=immediate end, N>0=budget
params.reasoning_budget = 1024
# Sampling-level reasoning budget
params.sampling.reasoning_budget_tokens = 500Speculative decoding uses a smaller draft model to speed up generation. Configure it via params.speculative:
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
params.n_predict = 128
params.n_ctx = 512
# Draft model for speculative decoding
params.speculative.type = xlc.common_speculative_type.COMMON_SPECULATIVE_TYPE_DRAFT
params.speculative.mparams_dft.path = "models/small-draft-model.gguf"
params.speculative.n_max = 16 # Max tokens to draft
params.speculative.n_min = 5 # Min tokens to draft
params.speculative.p_min = 0.75 # Min probability threshold
# Or use n-gram based speculative decoding (no draft model needed)
params.speculative.type = xlc.common_speculative_type.COMMON_SPECULATIVE_TYPE_LOOKUP
params.speculative.lookup_cache_static = "cache.bin"
params.speculative.ngram_size_n = 12
params.speculative.ngram_size_m = 48
params.speculative.ngram_min_hits = 1Both handle_completions() and handle_chat_completions() accept all llama.cpp server options in the request dict. Common options include:
result = server.handle_chat_completions({
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128,
"temperature": 0.7,
"top_k": 40,
"top_p": 0.95,
"min_p": 0.05,
"seed": 42,
"stream": False,
"stop": ["\n\n", "User:"], # Stop sequences
"cache_prompt": True, # Reuse KV cache from previous requests
"n_probs": 5, # Return top-N token probabilities
"repeat_penalty": 1.1,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"grammar": "...", # BNF-like grammar constraint
"json_schema": {"type": "object"}, # JSON schema constraint
"response_format": {"type": "json_object"}, # OpenAI-compatible format
"samplers": ["dry", "top_k", "typ_p", "top_p", "min_p", "xtc", "temperature"],
"lora": [{"id": 0, "scale": 0.5}], # Per-request LoRA scaling
})Estimate how many layers can be offloaded to GPU(s) given available memory:
from xllamacpp import estimate_gpu_layers
devices = [
{"name": "cuda", "memory_free": 8 * 1024**3, "memory_min": 2048},
]
estimate = estimate_gpu_layers(
devices,
"models/my-model.gguf",
[], # projector paths
context_length=2048,
batch_size=512,
num_parallel=1,
kv_cache_type="",
)
print(f"Layers to offload: {estimate.layers}")
print(f"VRAM usage: {estimate.vram_size / 1024**3:.2f} GB")
print(f"Total model size: {estimate.total_size / 1024**3:.2f} GB")
print(f"Tensor split: {estimate.tensor_split}")Query the runtime environment, including CPU features and available compute devices:
import xllamacpp as xlc
# Get system info string (CPU features, build options, etc.)
print(xlc.get_system_info())
# Get available compute devices (CPU, CUDA, Metal, etc.)
devices = xlc.get_device_info()
for dev in devices:
print(dev["name"])xlc.Server automatically starts an HTTP server that provides OpenAI API compatible endpoints. The server supports continuous batching, parallel decoding, prompt caching, and multimodal inputs. For the full specification, see the llama.cpp server documentation.
The server exposes the following endpoints. For full details on request/response formats, see the llama.cpp server documentation.
General:
| Endpoint | Method | Description |
|---|---|---|
/health, /v1/health |
GET | Health check (public, no API key required) |
/models, /v1/models |
GET | List loaded models (OpenAI compatible) |
/api/tags |
GET | List models (Ollama compatible) |
/props |
GET/POST | Server properties & default generation settings |
/metrics |
GET | Prometheus-format metrics (requires endpoint_metrics = True) |
/slots |
GET | View inference slot states (requires endpoint_slots = True) |
Inference:
| Endpoint | Method | Description |
|---|---|---|
/completion |
POST | Text completions (llama.cpp native format) |
/v1/completions, /completions |
POST | Text completions (OpenAI compatible) |
/v1/chat/completions, /chat/completions |
POST | Chat completions (OpenAI compatible) |
/api/chat |
POST | Chat (Ollama compatible) |
/v1/responses |
POST | Responses API (OpenAI compatible) |
/v1/messages |
POST | Messages API (Anthropic compatible) |
/infill |
POST | Code infill (FIM: fill-in-the-middle) |
Embeddings & Reranking:
| Endpoint | Method | Description |
|---|---|---|
/v1/embeddings |
POST | Generate embeddings (OpenAI compatible) |
/embedding |
POST | Generate embeddings (llama.cpp native, supports pooling none) |
/rerank, /v1/rerank, /v1/reranking |
POST | Rerank documents by relevance |
Tokenization & Templates:
| Endpoint | Method | Description |
|---|---|---|
/tokenize |
POST | Tokenize text to token IDs |
/detokenize |
POST | Convert token IDs back to text |
/apply-template |
POST | Apply chat template without inference |
LoRA & Slot Management:
| Endpoint | Method | Description |
|---|---|---|
/lora-adapters |
GET | List loaded LoRA adapters with scales |
/lora-adapters |
POST | Update global LoRA adapter scales |
/slots/{id}?action=save |
POST | Save slot KV cache to file |
/slots/{id}?action=restore |
POST | Restore slot KV cache from file |
/slots/{id}?action=erase |
POST | Erase slot KV cache |
import xllamacpp as xlc
from openai import OpenAI
# Start server
params = xlc.CommonParams()
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
server = xlc.Server(params)
# Connect using OpenAI client
client = OpenAI(
base_url=server.listening_address + "/v1",
api_key="not-required" # No API key needed for local server
)
# Chat completion
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "What is the capital of France?"}],
max_tokens=10
)
print(response.choices[0].message.content)import requests
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
params.n_ctx = 256
params.endpoint_metrics = True
server = xlc.Server(params)
base_url = server.listening_address
# Health check
resp = requests.get(f"{base_url}/health")
print(resp.json()) # {"status": "ok"}
# Chat completion
resp = requests.post(f"{base_url}/v1/chat/completions", json={
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 32,
})
print(resp.json()["choices"][0]["message"]["content"])
# Streaming
resp = requests.post(f"{base_url}/v1/chat/completions", json={
"messages": [{"role": "user", "content": "Tell me a joke."}],
"max_tokens": 64,
"stream": True,
}, stream=True)
for line in resp.iter_lines():
if line:
print(line.decode())
# Tokenize / Detokenize
resp = requests.post(f"{base_url}/tokenize", json={
"content": "Hello world, how are you?",
"add_special": False,
"with_pieces": True, # Return token pieces alongside IDs
})
print(resp.json()["tokens"])
resp = requests.post(f"{base_url}/detokenize", json={
"tokens": [1, 2, 3, 4, 5]
})
print(resp.json()["content"])
# Apply chat template (format messages without inference)
resp = requests.post(f"{base_url}/apply-template", json={
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
]
})
print(resp.json()["prompt"]) # Formatted prompt string
# Metrics (Prometheus format)
resp = requests.get(f"{base_url}/metrics")
print(resp.text)import requests
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/Qwen3-Embedding-0.6B-Q8_0.gguf"
params.embedding = True
params.n_ctx = 512
params.n_batch = 128
params.n_ubatch = 128
params.pooling_type = xlc.llama_pooling_type.LLAMA_POOLING_TYPE_LAST
server = xlc.Server(params)
base_url = server.listening_address
resp = requests.post(f"{base_url}/v1/embeddings", json={
"input": ["I believe the meaning of life is", "This is a test"],
})
data = resp.json()
for item in data["data"]:
print(f"Index {item['index']}: {len(item['embedding'])} dimensions")import requests
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/bge-reranker-v2-m3-Q2_K.gguf"
params.embedding = True
params.reranking = True
params.n_ctx = 512
params.n_batch = 128
params.n_ubatch = 128
params.pooling_type = xlc.llama_pooling_type.LLAMA_POOLING_TYPE_RANK
server = xlc.Server(params)
base_url = server.listening_address
resp = requests.post(f"{base_url}/v1/rerank", json={
"model": "bge-reranker",
"query": "What is panda?",
"top_n": 3,
"documents": [
"hi",
"it is a bear",
"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
],
})
for doc in resp.json()["results"]:
print(f"Document {doc['index']}: relevance_score={doc['relevance_score']:.4f}")The /infill endpoint supports fill-in-the-middle (FIM) for code completion:
import requests
resp = requests.post(f"{base_url}/infill", json={
"input_prefix": "def fibonacci(n):\n if n <= 1:\n return n\n",
"input_suffix": "\n return fibonacci(n-1) + fibonacci(n-2)",
"max_tokens": 64,
})
print(resp.json()["content"])Manage LoRA adapters at runtime through HTTP endpoints:
import requests
# List loaded LoRA adapters
resp = requests.get(f"{base_url}/lora-adapters")
print(resp.json()) # [{"id": 0, "path": "...", "scale": 1.0}, ...]
# Update global LoRA scales
requests.post(f"{base_url}/lora-adapters", json=[
{"id": 0, "scale": 0.5},
])
# Per-request LoRA scaling (overrides global scale)
resp = requests.post(f"{base_url}/v1/chat/completions", json={
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 32,
"lora": [{"id": 0, "scale": 0.8}],
})The server supports multiple concurrent requests via parallel slots:
params = xlc.CommonParams()
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
params.n_parallel = 4 # 4 concurrent request slots
params.n_ctx = 2048 # Context shared across slots
params.cont_batching = True # Enable continuous batching (default: True)
server = xlc.Server(params)Prompt caching is enabled by default and reuses KV cache from previous requests when prompts share a common prefix:
params = xlc.CommonParams()
params.cache_prompt = True # Enabled by default
params.n_cache_reuse = 256 # Min chunk size for KV shifting reuse (0 = disabled)You can also control caching per-request:
result = server.handle_chat_completions({
"messages": [{"role": "user", "content": "Hello!"}],
"cache_prompt": True, # Reuse KV cache from previous request
})The server supports automatic sleep after idle time to free resources. When sleeping, the model and KV cache are unloaded from memory. Any new inference request automatically wakes the server.
import xllamacpp as xlc
params = xlc.CommonParams()
params.model.path = "models/Llama-3.2-1B-Instruct-Q8_0.gguf"
params.sleep_idle_seconds = 60 # Sleep after 60 seconds of inactivity
server = xlc.Server(params)
# Server will automatically sleep/wake as needed
# Check status via: GET /props → {"is_sleeping": true/false}Note: /health, /props, and /models endpoints remain responsive during sleep and do not trigger a wake-up.
The tests directory provides extensive examples and test coverage for xllamacpp.
As a first step, download a small GGUF model from HuggingFace. The default test model is Llama-3.2-1B-Instruct-Q8_0.gguf. Models should be placed in the models/ directory:
make downloadThis basically just does:
cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf You can verify your installation with llama-cli:
bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
-p "Is mathematics discovered or invented?"Run the full test suite:
make test