Skip to content

Models

Qwen-Coder

  • Qwen2.5-Coder-32B-Instruct
  • HF: Qwen/Qwen2.5-Coder-32B-Instruct
  • Hardware:
    • 1x H100/H200
    • --tool-call-parser hermes --enable-auto-tool-choice
    • 2x H100/H200
    • --tensor-parallel-size 2 --tool-call-parser hermes --enable-auto-tool-choice
  • Notes: Good balance of size and performance. Single GPU capable.
  • Qwen3-Coder-480B-A35B-Instruct (BF16)
  • HF: Qwen/Qwen3-Coder-480B-A35B-Instruct
  • Hardware:
    • 8x H200/H20
    • --tensor-parallel-size 8 --max-model-len 32000 --enable-auto-tool-choice --tool-call-parser qwen3_coder
    • Notes: Cannot serve full 262K context on single node. Reduce max-model-len or increase gpu-memory-utilization.
  • Qwen3-Coder-480B-A35B-Instruct-FP8
  • HF: Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
  • Hardware:
    • 8x H200/H20
    • --max-model-len 131072 --enable-expert-parallel --data-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder
    • Env: VLLM_USE_DEEP_GEMM=1
    • Notes: Use data-parallel mode (not tensor-parallel) to avoid weight quantization errors. DeepGEMM recommended.
  • Qwen3-Coder-30B-A3B-Instruct (BF16)
  • HF: Qwen/Qwen3-Coder-30B-A3B-Instruct
  • Hardware:
    • 1x H100/H200
    • --enable-auto-tool-choice --tool-call-parser qwen3_coder
    • Notes: Fits comfortably on single GPU. ~60GB model weight.
    • 2x H100/H200
    • --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
    • Notes: For higher throughput/longer context.
  • Qwen3-Coder-30B-A3B-Instruct-FP8
  • HF: Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
  • Hardware:
    • 1x H100/H200
    • --enable-auto-tool-choice --tool-call-parser qwen3_coder
    • Env: VLLM_USE_DEEP_GEMM=1
    • Notes: FP8 quantized, ~30GB model weight. Excellent for single GPU deployment.

GPT-OSS

  • Notes: Requires vLLM 0.10.1+gptoss. Built-in tools via /v1/responses endpoint (browsing, Python). Function calling not yet supported. --async-scheduling recommended for higher perf (not compatible with structured output).
  • GPT-OSS-20B
  • HF: openai/gpt-oss-20b
  • Hardware:
    • 1x H100/H200
    • --async-scheduling
    • 1x B200
    • --async-scheduling
    • Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
  • GPT-OSS-120B
  • HF: openai/gpt-oss-120b
  • Hardware:
    • 1x H100/H200
    • --async-scheduling
    • Notes: Needs --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 to avoid OOM
    • 2x H100/H200
    • --tensor-parallel-size 2 --async-scheduling
    • Notes: Set --gpu-memory-utilization <0.95 to avoid OOM
    • 4x H100/H200
    • --tensor-parallel-size 4 --async-scheduling
    • 8x H100/H200
    • --tensor-parallel-size 8 --async-scheduling --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 128 --gpu-memory-utilization 0.85 --no-enable-prefix-caching
    • 1x B200
    • --async-scheduling
    • Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
    • 2x B200
    • --tensor-parallel-size 2 --async-scheduling
    • Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1

GLM-4.5

  • Notes: Listed configs support reduced context. For full 128K context, double the GPU count. Models default to thinking mode (disable with API param).
  • GLM-4.5 (BF16)
  • HF: zai-org/GLM-4.5
  • Hardware:
    • 16x H100
    • --tensor-parallel-size 16 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    • 8x H200
    • --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
  • Notes: On 8x H100, may need --cpu-offload-gb 16 to avoid OOM. For full 128K: needs 32x H100 or 16x H200.
  • GLM-4.5-FP8
  • HF: zai-org/GLM-4.5-FP8
  • Hardware:
    • 8x H100
    • --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    • 4x H200
    • --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
  • Notes: For full 128K context: needs 16x H100 or 8x H200.
  • GLM-4.5-Air (BF16)
  • HF: zai-org/GLM-4.5-Air
  • Hardware:
    • 4x H100
    • --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    • 2x H200
    • --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
  • Notes: For full 128K context: needs 8x H100 or 4x H200.
  • GLM-4.5-Air-FP8
  • HF: zai-org/GLM-4.5-Air-FP8
  • Hardware:
    • 2x H100
    • --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    • 1x H200
    • --tensor-parallel-size 1 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
  • Notes: For full 128K context: needs 4x H100 or 2x H200.

Kimi

  • Notes: Requires vLLM v0.10.0rc1+. Minimum 16 GPUs for FP8 with 128k context. Reuses DeepSeekV3 architecture with model_type="kimi_k2".
  • Kimi-K2-Instruct
  • HF: moonshotai/Kimi-K2-Instruct
  • Hardware:
    • 16x H200/H20
    • --tensor-parallel-size 16 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
    • Notes: Pure TP mode. For >16 GPUs, combine with pipeline-parallelism.
    • 16x H200/H20 (DP+EP mode)
    • --data-parallel-size 16 --data-parallel-size-local 8 --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
    • Notes: Data parallel + expert parallel mode for higher throughput. Requires multi-node setup with proper networking.