MAX is a high-performance inference server that provides an OpenAI-compatible endpoint for large language models (LLMs) locally or in the cloud. To start your own serving endpoint with just a few commands, check out our quickstart guide.
The inference server is built in Python and the source is available in our GitHub repo, but we're currently not accepting contributions to the serving library.
Users must adhere to the terms of usage for MAX and Mojo. Modular Community License.