Models
Qwen-Coder¶
- Qwen2.5-Coder-32B-Instruct
- HF: Qwen/Qwen2.5-Coder-32B-Instruct
- Hardware:
- 1x H100/H200
- --tool-call-parser hermes --enable-auto-tool-choice
- 2x H100/H200
- --tensor-parallel-size 2 --tool-call-parser hermes --enable-auto-tool-choice
- Notes: Good balance of size and performance. Single GPU capable.
- Qwen3-Coder-480B-A35B-Instruct (BF16)
- HF: Qwen/Qwen3-Coder-480B-A35B-Instruct
- Hardware:
- 8x H200/H20
- --tensor-parallel-size 8 --max-model-len 32000 --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Notes: Cannot serve full 262K context on single node. Reduce max-model-len or increase gpu-memory-utilization.
- Qwen3-Coder-480B-A35B-Instruct-FP8
- HF: Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
- Hardware:
- 8x H200/H20
- --max-model-len 131072 --enable-expert-parallel --data-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Env: VLLM_USE_DEEP_GEMM=1
- Notes: Use data-parallel mode (not tensor-parallel) to avoid weight quantization errors. DeepGEMM recommended.
- Qwen3-Coder-30B-A3B-Instruct (BF16)
- HF: Qwen/Qwen3-Coder-30B-A3B-Instruct
- Hardware:
- 1x H100/H200
- --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Notes: Fits comfortably on single GPU. ~60GB model weight.
- 2x H100/H200
- --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Notes: For higher throughput/longer context.
- Qwen3-Coder-30B-A3B-Instruct-FP8
- HF: Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
- Hardware:
- 1x H100/H200
- --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Env: VLLM_USE_DEEP_GEMM=1
- Notes: FP8 quantized, ~30GB model weight. Excellent for single GPU deployment.
GPT-OSS¶
- Notes: Requires vLLM 0.10.1+gptoss. Built-in tools via /v1/responses endpoint (browsing, Python). Function calling not yet supported. --async-scheduling recommended for higher perf (not compatible with structured output).
- GPT-OSS-20B
- HF: openai/gpt-oss-20b
- Hardware:
- 1x H100/H200
- --async-scheduling
- 1x B200
- --async-scheduling
- Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
- GPT-OSS-120B
- HF: openai/gpt-oss-120b
- Hardware:
- 1x H100/H200
- --async-scheduling
- Notes: Needs --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 to avoid OOM
- 2x H100/H200
- --tensor-parallel-size 2 --async-scheduling
- Notes: Set --gpu-memory-utilization <0.95 to avoid OOM
- 4x H100/H200
- --tensor-parallel-size 4 --async-scheduling
- 8x H100/H200
- --tensor-parallel-size 8 --async-scheduling --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 128 --gpu-memory-utilization 0.85 --no-enable-prefix-caching
- 1x B200
- --async-scheduling
- Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
- 2x B200
- --tensor-parallel-size 2 --async-scheduling
- Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
GLM-4.5¶
- Notes: Listed configs support reduced context. For full 128K context, double the GPU count. Models default to thinking mode (disable with API param).
- GLM-4.5 (BF16)
- HF: zai-org/GLM-4.5
- Hardware:
- 16x H100
- --tensor-parallel-size 16 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- 8x H200
- --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- Notes: On 8x H100, may need --cpu-offload-gb 16 to avoid OOM. For full 128K: needs 32x H100 or 16x H200.
- GLM-4.5-FP8
- HF: zai-org/GLM-4.5-FP8
- Hardware:
- 8x H100
- --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- 4x H200
- --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- Notes: For full 128K context: needs 16x H100 or 8x H200.
- GLM-4.5-Air (BF16)
- HF: zai-org/GLM-4.5-Air
- Hardware:
- 4x H100
- --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- 2x H200
- --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- Notes: For full 128K context: needs 8x H100 or 4x H200.
- GLM-4.5-Air-FP8
- HF: zai-org/GLM-4.5-Air-FP8
- Hardware:
- 2x H100
- --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- 1x H200
- --tensor-parallel-size 1 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- Notes: For full 128K context: needs 4x H100 or 2x H200.
Kimi¶
- Notes: Requires vLLM v0.10.0rc1+. Minimum 16 GPUs for FP8 with 128k context. Reuses DeepSeekV3 architecture with model_type="kimi_k2".
- Kimi-K2-Instruct
- HF: moonshotai/Kimi-K2-Instruct
- Hardware:
- 16x H200/H20
- --tensor-parallel-size 16 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
- Notes: Pure TP mode. For >16 GPUs, combine with pipeline-parallelism.
- 16x H200/H20 (DP+EP mode)
- --data-parallel-size 16 --data-parallel-size-local 8 --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
- Notes: Data parallel + expert parallel mode for higher throughput. Requires multi-node setup with proper networking.