inference-server
Installation
SKILL.md
Inference Server
Starting the server
Always use the inference entry point — never vllm serve or python -m vllm.entrypoints.openai.api_server directly. The entry point runs setup_vllm_env() which configures environment variables (LoRA, multiprocessing) before vLLM is imported.
# With a TOML config
uv run inference @ path/to/config.toml
# With CLI overrides
uv run inference --model.name Qwen/Qwen3-0.6B --model.max_model_len 2048 --model.enforce_eager
# Combined
uv run inference @ path/to/config.toml --server.port 8001 --gpu-memory-utilization 0.5
SLURM scheduling
Related skills