Host an LLM on a non-blocking HTTP server with a simple interface for managing and querying models. This allows you to host an LLM on LLM-capable harware and then access it over the network from devices that can't run an LLM locally.
Built on FastAPI and uvicorn, it wraps the large-language-model library to expose LLM inference over a REST API.
A user friendly GUI interface is included (but not required) for general use.
See Releases to install from wheel file.
See pyproject.toml for required Python version and dependencies.
This library uses optional dependencies for additional features. See the pyproject.toml for the list of optional library tags. To install with GUI support, use the gui tag.
Install this repo as a library into another project.
uv add "llm_server @ git+https://github.com/EricApgar/llm-server"
...with optional libraries:
uv add "llm_server[gui] @ git+https://github.com/EricApgar/llm-server"
pip install "llm_server @ git+https://github.com/EricApgar/llm-server"
...with optional libraries:
pip install "llm_server[gui] @ git+https://github.com/EricApgar/llm-server"
Run locally for development of this repo. Create a virtual environment and then install the dependencies into the environment.
uv sync
...with optional libraries:
uv sync --extra gui
pip install -e "."
...with optional libraries:
pip install -e ".[gui]"
Running an LLM requires an NVIDIA GPU with ideally a large number of TOPS. See the large-language-model repo for hardware details and GPU driver setup.
Create a server, register a model with a tag, load it, and start serving. Then send a request and receive a response.
import llm_server
server = llm_server.Server()
server.set_host(ip_address='127.0.0.1', port=8000)
server.add_model(tag='gpt', name='openai/gpt-oss-20b')
server.load_model(tag='gpt', location=<path to model cache dir>)
server.start() # Non-blocking.
server.stop()
URL = 'http://127.0.0.1:8001/ask'
details = {...} # See request body examples below.
response = requests.post(URL, json=details, timeout=15)
data = response.json()
print(data['text'])
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Health check — returns "Running.". |
| GET | /get-models |
List all available hosted models and their tags. |
| GET | /ask-test |
Send a test prompt ("Tell me a joke.") to the first available model. |
| POST | /ask |
Send a prompt to a specific model. |
{
"tag": "gpt",
"prompt": "Tell me a joke.",
"max_tokens": 64,
"temperature": 0.9
}
promptmay also be a Conversation formatted dict output (see llm-conversation).llm_conversation.Conversation.to_dict()
{
"tag": "Phi4",
"prompt": "Describe the image.",
"images": [llm_server.encode_image(<path to image>)],
"max_tokens": 64,
}
temperaturearg currently not supported for Phi-4-multimodal-instruct.imagesaccepts base64-encoded PNG strings for multimodal models. Encoder is provided byllm_server.
