Skip to content

Running vLLM locally what is the best model? #16390

@wes-kay

Description

@wes-kay

Question

On my 5090, I've tested two models so far with locally hosted vLLM:

  1. Qwen3-32B-AWQ
  • Will actually call bash commands and CRUD files
  • Spends 20k tokens trying to install flutter dependencies, it's 50/50 if it will even build a test project.
  1. Qwen2.5-Coder-32B-Instruct-AWQ
  • Does not run any bash or CRUD
  • Will only output .json
Image

1. Tool-Use Capability

Which open-weight models reliably support tool calling (bash execution, filesystem CRUD, API calls) when served through vLLM?

2. Model Behavior Differences

Why would Qwen3-32B-AWQ execute shell commands while Qwen2.5-Coder-32B-Instruct-AWQ only output structured JSON responses? Is there a way to change any configs to allow qwen coder to use opencode properly?

3. Best Models for Local Coding Agents

Which locally runnable models to work with opencode under roughly 40B parameters (5090 32gb) currently perform best for coding agents that must plan tasks, edit files, and execute shell commands?

4. Quantization Impact

Does AWQ quantization significantly degrade planning, reasoning, or tool-calling ability in coding models compared with FP16 or GPTQ?

Metadata

Metadata

Assignees

Labels

coreAnything pertaining to core functionality of the application (opencode server stuff)docs

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions