Build Arguments¶
The Dockerfile accepts 7 build arguments that control CUDA version, attention optimizations, and LLM support. All have sensible defaults targeting the latest GPU generation (Blackwell/Hopper).
Reference¶
CUDA_VERSION¶
NVIDIA CUDA base image version. Determines the CUDA toolkit used for compilation and runtime.
| Value | Target GPUs | Base Image |
|---|---|---|
12.8.1 (default) |
Blackwell (B200, B100), Hopper (H200, H100) | nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04 |
12.4.1 |
Ampere (A100, A6000, RTX 30xx), Ada Lovelace (L40S, RTX 40xx) | nvidia/cuda:12.4.1-cudnn-devel-ubuntu24.04 |
12.1.1 |
Turing (RTX 20xx, T4), Volta (V100) | nvidia/cuda:12.1.1-cudnn-devel-ubuntu24.04 |
Must be paired with the matching PYTORCH_INDEX.
PYTORCH_INDEX¶
PyTorch wheel index tag. Selects the correct pre-built PyTorch wheels for your CUDA version.
| Value | CUDA Version | PyTorch Install URL |
|---|---|---|
cu128 (default) |
12.8.1 | https://download.pytorch.org/whl/cu128 |
cu124 |
12.4.1 | https://download.pytorch.org/whl/cu124 |
cu121 |
12.1.1 | https://download.pytorch.org/whl/cu121 |
This must always match CUDA_VERSION. Mismatched pairs will produce a broken image (PyTorch compiled for one CUDA version running on another).
PYTHON_VERSION¶
Python interpreter version installed in the image.
| Value | Notes |
|---|---|
3.12 (default) |
Recommended for ComfyUI as of 2026. Used in the venv at /opt/venv. |
Change only if a specific custom node requires a different Python version. The Dockerfile installs python${PYTHON_VERSION}, python${PYTHON_VERSION}-venv, and python${PYTHON_VERSION}-dev from the Ubuntu package repositories.
ENABLE_SAGE_ATTENTION¶
Install SageAttention and Triton for optimized attention computation.
| Value | Behavior |
|---|---|
true (default) |
Installs triton and sageattention via pip |
false |
Skips installation entirely |
GPU requirements:
| GPU Generation | Compute Capability | SageAttention Version |
|---|---|---|
| Hopper, Blackwell | SM 90+ | v2 with FP8 kernels -- fastest, uses 8-bit floating point for attention |
| Ampere, Ada Lovelace | SM 80-89 | v1 -- optimized CUDA kernels, 2-3x faster than default attention |
| Turing, Volta | SM 70-75 | Not supported -- set to false |
SageAttention provides 2-3x faster attention computation during image and video generation. It installs from pre-built wheels, so there is minimal impact on build time.
Set to false for SM 75 and below
On Turing (RTX 20xx, T4) and Volta (V100) GPUs, SageAttention will fail at runtime. Always set ENABLE_SAGE_ATTENTION=false for these GPUs.
ENABLE_FLASH_ATTENTION¶
Install FlashAttention for memory-efficient fused attention kernels.
| Value | Behavior |
|---|---|
true (default) |
Installs flash-attn from source via pip (--no-build-isolation) |
false |
Skips installation entirely |
GPU requirements:
| GPU Generation | Compute Capability | FlashAttention Version |
|---|---|---|
| Hopper, Blackwell | SM 90+ | v3 -- newest, optimized for Hopper architecture |
| Ampere, Ada Lovelace | SM 80-89 | v2 -- memory-efficient fused kernels |
| Turing, Volta | SM 70-75 | Not supported -- set to false |
FlashAttention lets you run larger batch sizes or higher resolutions without running out of VRAM by reducing the memory footprint of attention computation.
Builds from source -- 20-30 minutes
Unlike SageAttention, FlashAttention builds from source during pip install. This adds 20-30 minutes to the Docker build. The CI/CD pipeline defaults to false for this reason.
Set to false for SM 75 and below
On Turing (RTX 20xx, T4) and Volta (V100) GPUs, FlashAttention is not supported. Always set ENABLE_FLASH_ATTENTION=false for these GPUs.
ENABLE_LLM¶
Compile llama.cpp server for local LLM inference on the pod GPU.
| Value | Behavior |
|---|---|
true (default) |
Clones llama.cpp, compiles llama-server with CUDA, installs binary to /opt/llama-server |
false |
Skips the entire llama.cpp clone and compilation step |
The binary supports GPU-accelerated inference with GGUF models (Qwen, Llama, Mistral, etc.) and is managed through the ComfyUI Studio LLM page.
Build time impact
llama-server compilation takes ~10 minutes when a GPU is present during the build (single architecture) or ~60 minutes on CI runners without a GPU (compiles for all architectures SM 75 through SM 100). Set to false to save this time if you do not need LLM support.
See llama-server for details on the compilation process.
LLAMA_CPP_VERSION¶
The llama.cpp git tag to clone and build. Only relevant when ENABLE_LLM=true.
| Value | Behavior |
|---|---|
b8505 (default) |
Pinned release, tested and known to build correctly. Pinned on 2026-03-24. |
latest |
Clones HEAD of the llama.cpp repository (--depth 1, no tag) |
Any tag (e.g., b8400) |
Clones that specific release tag |
Pin for reproducibility
The default pinned version ensures reproducible builds. Using latest gets the newest features and fixes but risks build failures from upstream breaking changes. If a latest build fails, switch back to the pinned version or specify a known-good tag.
Build Examples¶
Default (Blackwell/Hopper, all features)¶
Uses all defaults: CUDA 12.8.1, cu128, SageAttention v2, FlashAttention v3, llama-server b8505.
Ampere / Ada Lovelace (A100, RTX 30xx/40xx, L40S)¶
docker build -f docker/production/Dockerfile -t comfyui-studio \
--build-arg CUDA_VERSION=12.4.1 \
--build-arg PYTORCH_INDEX=cu124 \
.
SageAttention v1 and FlashAttention v2 are installed (both default to true).
Turing / Volta (RTX 20xx, T4, V100)¶
docker build -f docker/production/Dockerfile -t comfyui-studio \
--build-arg CUDA_VERSION=12.1.1 \
--build-arg PYTORCH_INDEX=cu121 \
--build-arg ENABLE_SAGE_ATTENTION=false \
--build-arg ENABLE_FLASH_ATTENTION=false \
.
No attention optimizations. xformers and torch.compile still work on these GPUs.
Fast CI build (no FlashAttention, no LLM)¶
docker build -f docker/production/Dockerfile -t comfyui-studio \
--build-arg ENABLE_FLASH_ATTENTION=false \
--build-arg ENABLE_LLM=false \
.
Skips both source compilations. Build time drops from ~85-115 minutes to ~15-25 minutes.
Custom llama.cpp version¶
docker build -f docker/production/Dockerfile -t comfyui-studio \
--build-arg LLAMA_CPP_VERSION=b8400 \
.
Latest llama.cpp (bleeding edge)¶
docker build -f docker/production/Dockerfile -t comfyui-studio \
--build-arg LLAMA_CPP_VERSION=latest \
.
Argument Pairing Quick Reference¶
| GPU Family | CUDA_VERSION | PYTORCH_INDEX | SAGE_ATTENTION | FLASH_ATTENTION |
|---|---|---|---|---|
| Blackwell / Hopper | 12.8.1 |
cu128 |
true (v2 FP8) |
true (v3) |
| Ampere / Ada Lovelace | 12.4.1 |
cu124 |
true (v1) |
true (v2) |
| Turing / Volta | 12.1.1 |
cu121 |
false |
false |
See GPU Compatibility for the full per-GPU breakdown.